Bug 1343235

Summary: [RH Ceph 2.0 / 10.2.1-7.el7cp ] tcmalloc/libc deadlock causes rados thrash ec profile k=2,m=1 osd restart failed/timeout
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasu Kulkarni <vakulkar>
Component: RADOSAssignee: Samuel Just <sjust>
Status: CLOSED ERRATA QA Contact: Vasu Kulkarni <vakulkar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.0CC: ceph-eng-bugs, ceph-qe-bugs, dzafman, gmeno, hnallurv, kchai, kdreyer, sjust, vakulkar
Target Milestone: rc   
Target Release: 2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gperftools-2.4-8.el7 ceph-10.2.2-19.el7cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 19:40:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vasu Kulkarni 2016-06-06 21:43:02 UTC
Description of problem:

Test: rados/thrash-erasure-code-isa/{rados.yaml arch/x86_64.yaml clusters/{fixed-2.yaml
    openstack.yaml} fs/xfs.yaml msgr-failures/fastclose.yaml thrashers/mapgap.yaml workloads/ec-rados-plugin=isa-k=2-m=1.yaml}

Seen in 2 of the ec profile tests, atleast one osd failed to respond to restart during thrash

2016-06-04T12:26:11.524 INFO:tasks.rados.rados.0.clara006.stdout:1776: read oid 8 snap -1
2016-06-04T12:26:11.525 INFO:tasks.rados.rados.0.clara006.stdout:1776:  expect deleted
2016-06-04T12:26:11.526 INFO:tasks.rados.rados.0.clara006.stdout:append oid 44 current snap is 166
2016-06-04T12:26:11.526 INFO:tasks.rados.rados.0.clara006.stdout:1777:  seq_num 556 ranges {0=516096,516096=520192,1036288=634880,1671168=450560}
2016-06-04T12:26:11.527 INFO:tasks.rados.rados.0.clara006.stdout:1777:  writing clara006430203-44 from 0 to 516096 tid 1
2016-06-04T12:26:11.528 INFO:tasks.rados.rados.0.clara006.stdout:1777:  writing clara006430203-44 from 516096 to 1036288 tid 2
2016-06-04T12:26:11.529 INFO:tasks.rados.rados.0.clara006.stdout:1777:  writing clara006430203-44 from 1036288 to 1671168 tid 3
2016-06-04T12:26:11.530 INFO:tasks.rados.rados.0.clara006.stdout:1777:  writing clara006430203-44 from 1671168 to 2121728 tid 4
2016-06-04T12:26:11.531 INFO:tasks.rados.rados.0.clara006.stdout:1778: read oid 33 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo snap -1
2016-06-04T12:26:11.532 INFO:tasks.rados.rados.0.clara006.stdout:1778:  expect (ObjNum 250 snap 56 seq_num 250)
2016-06-04T12:26:11.532 INFO:tasks.rados.rados.0.clara006.stdout:1779: snap_remove snap 158
2016-06-04T12:26:12.020 INFO:teuthology.orchestra.run.clara006:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_ops_in_flight'
2016-06-04T12:26:12.188 INFO:teuthology.orchestra.run.clara006.stderr:no valid command found; 10 closest matches:
2016-06-04T12:26:12.190 INFO:teuthology.orchestra.run.clara006.stderr:config set <var> <val> [<val>...]
2016-06-04T12:26:12.191 INFO:teuthology.orchestra.run.clara006.stderr:version
2016-06-04T12:26:12.192 INFO:teuthology.orchestra.run.clara006.stderr:perfcounters_schema
2016-06-04T12:26:12.193 INFO:teuthology.orchestra.run.clara006.stderr:git_version
2016-06-04T12:26:12.194 INFO:teuthology.orchestra.run.clara006.stderr:help
2016-06-04T12:26:12.195 INFO:teuthology.orchestra.run.clara006.stderr:config show
2016-06-04T12:26:12.196 INFO:teuthology.orchestra.run.clara006.stderr:get_command_descriptions
2016-06-04T12:26:12.197 INFO:teuthology.orchestra.run.clara006.stderr:config get <var>
2016-06-04T12:26:12.198 INFO:teuthology.orchestra.run.clara006.stderr:perf schema
2016-06-04T12:26:12.199 INFO:teuthology.orchestra.run.clara006.stderr:2
2016-06-04T12:26:12.200 INFO:teuthology.orchestra.run.clara006.stderr:admin_socket: invalid command
2016-06-04T12:26:12.203 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/ceph-qa-suite_rh21/tasks/ceph_manager.py", line 639, in wrapper
    return func(self)
  File "/home/teuthworker/src/ceph-qa-suite_rh21/tasks/ceph_manager.py", line 683, in do_thrash
    self.test_map_discontinuity()
  File "/home/teuthworker/src/ceph-qa-suite_rh21/tasks/ceph_manager.py", line 568, in test_map_discontinuity
    self.revive_osd()
  File "/home/teuthworker/src/ceph-qa-suite_rh21/tasks/ceph_manager.py", line 351, in revive_osd
    skip_admin_check=skip_admin_check)
  File "/home/teuthworker/src/ceph-qa-suite_rh21/tasks/ceph_manager.py", line 1812, in revive_osd
    timeout=timeout)
  File "/home/teuthworker/src/ceph-qa-suite_rh21/tasks/ceph_manager.py", line 1066, in wait_run_admin_socket
    id=service_id))
Exception: timed out waiting for admin_socket to appear after osd.1 restart


Version-Release number of selected component (if applicable):
Ceph version: 10.2.1-7.el7cp 


How reproducible:
2/2

Steps to Reproduce:
Run rados thrash with ec profile


Additional info:

http://magna002.ceph.redhat.com/vasu-2016-05-31_15:37:42-rados-jewel---basic-clara/237745/teuthology.log

another ec test:

http://magna002.ceph.redhat.com/vasu-2016-05-31_15:37:42-rados-jewel---basic-clara/237686/teuthology.log

Comment 2 Samuel Just 2016-06-09 15:03:19 UTC
In both cases, it looks like the osd started for a bit and then the process ended or hung, no evidence of a crash.

I also notice that in both cases, it looks like the class loading thread is hung.  That probably makes this a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1334636, except that the workaround won't help here.

Comment 3 Samuel Just 2016-06-09 15:04:46 UTC
Vasu: can you confirm by confirming that
1) while stuck like this the process is running and
2) attaching gdb while stuck like this and getting a dump of all threads

You'll probably have to increase the timeout to give yourself time to get to the process.

Comment 4 Vasu Kulkarni 2016-06-09 21:20:59 UTC
A rerun of the tests has actually passed, i am going to retry few more times and try to attach gdb when the process is stuck

rerun logs:
http://pulpito.ceph.redhat.com/vasu-2016-06-09_13:43:05-rados-jewel---basic-clara/

Comment 5 Samuel Just 2016-06-10 18:24:53 UTC
The plan is for vasu to do a bunch more runs over the weekend so we can get an idea of the frequency, and if it recurs, hopefully get the gdb backtraces so we can confirm.

Unlike with the hung ceph-disk version of this bug, we probably don't have a good workaround (although I guess we could hack up the systemd start script to automatically restart the osd process if the admin socket doesn't appear after a set amount of time?).  The "fix" would be to ship a version of tcmalloc without this problem.

Marking needs-info on vasu until we get more frequency information.

Comment 6 Vasu Kulkarni 2016-06-13 16:37:55 UTC
Over the weekend I think I hit this 3 times in around 60+ runs, so the frequency is quite good enough to get the logs, I think the image is stripped of debug symbols, but regardless the backtrace is 


(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/rados-classes/libcls_replica_log.so
Reading symbols from /usr/lib64/rados-classes/libcls_log.so...Reading symbols from /usr/lib64/rados-classes/libcls_log.so...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/rados-classes/libcls_log.so
Reading symbols from /usr/lib64/rados-classes/libcls_cephfs.so...Reading symbols from /usr/lib64/rados-classes/libcls_cephfs.so...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/rados-classes/libcls_cephfs.so
Reading symbols from /usr/lib64/rados-classes/libcls_rbd.so...Reading symbols from /usr/lib64/rados-classes/libcls_rbd.so...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/rados-classes/libcls_rbd.so
Reading symbols from /usr/lib64/rados-classes/libcls_timeindex.so...Reading symbols from /usr/lib64/rados-classes/libcls_timeindex.so...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/rados-classes/libcls_timeindex.so
Reading symbols from /usr/lib64/rados-classes/libcls_version.so...Reading symbols from /usr/lib64/rados-classes/libcls_version.so...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/rados-classes/libcls_version.so
Reading symbols from /usr/lib64/rados-classes/libcls_lock.so...Reading symbols from /usr/lib64/rados-classes/libcls_lock.so...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/rados-classes/libcls_lock.so
Reading symbols from /usr/lib64/rados-classes/libcls_hello.so...Reading symbols from /usr/lib64/rados-classes/libcls_hello.so...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/rados-classes/libcls_hello.so
Reading symbols from /usr/lib64/rados-classes/libcls_statelog.so...Reading symbols from /usr/lib64/rados-classes/libcls_statelog.so...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/rados-classes/libcls_statelog.so
Reading symbols from /usr/lib64/rados-classes/libcls_user.so...Reading symbols from /usr/lib64/rados-classes/libcls_user.so...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/rados-classes/libcls_user.so
Reading symbols from /usr/lib64/rados-classes/libcls_journal.so...Reading symbols from /usr/lib64/rados-classes/libcls_journal.so...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/rados-classes/libcls_journal.so
0x00007f1ed6b02573 in base::internal::SpinLockDelay(int volatile*, int, int) () from /lib64/libtcmalloc.so.4
Missing separate debuginfos, use: debuginfo-install ceph-osd-10.2.1-7.el7cp.x86_64
(gdb) thread apply all bt

Thread 28 (Thread 0x7f1ed29ac700 (LWP 1016329)):
#0  0x00007f1ed58ab6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1ed78bd20b in ceph::log::Log::entry() ()
#2  0x00007f1ed58a7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1ed3f32ced in clone () from /lib64/libc.so.6

Thread 27 (Thread 0x7f1ed1a38700 (LWP 1016330)):
#0  0x00007f1ed6b02573 in base::internal::SpinLockDelay(int volatile*, int, int) () from /lib64/libtcmalloc.so.4
#1  0x00007f1ed6b02447 in SpinLock::SlowLock() () from /lib64/libtcmalloc.so.4
#2  0x00007f1ed6af6958 in tcmalloc::ThreadCache::CreateCacheIfNecessary() () from /lib64/libtcmalloc.so.4
#3  0x00007f1ed6ae4f15 in (anonymous namespace)::do_malloc_no_errno(unsigned long) [clone .part.15] () from /lib64/libtcmalloc.so.4
#4  0x00007f1ed6b067b0 in tc_new () from /lib64/libtcmalloc.so.4
#5  0x00007f1ed78bc6cd in ceph::log::Log::create_entry(int, int, unsigned long*) ()
#6  0x00007f1ed78b4a0f in ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long) ()
#7  0x00007f1ed78b544e in ceph::HeartbeatMap::is_healthy() ()
#8  0x00007f1ed78b5c2c in ceph::HeartbeatMap::check_touch_file() ()
#9  0x00007f1ed79903cb in CephContextServiceThread::entry() ()
#10 0x00007f1ed58a7dc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f1ed3f32ced in clone () from /lib64/libc.so.6

Thread 26 (Thread 0x7f1ed1237700 (LWP 1016331)):
#0  0x00007f1ed6b02573 in base::internal::SpinLockDelay(int volatile*, int, int) () from /lib64/libtcmalloc.so.4
#1  0x00007f1ed6b02447 in SpinLock::SlowLock() () from /lib64/libtcmalloc.so.4
#2  0x00007f1ed6af3078 in tcmalloc::CentralFreeList::Populate() () from /lib64/libtcmalloc.so.4
#3  0x00007f1ed6af3148 in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () from /lib64/libtcmalloc.so.4
#4  0x00007f1ed6af31dd in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /lib64/libtcmalloc.so.4
#5  0x00007f1ed6af6235 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) () from /lib64/libtcmalloc.so.4
#6  0x00007f1ed6b06718 in tc_new () from /lib64/libtcmalloc.so.4
#7  0x00007f1ed79c3cd6 in boost::spirit::classic::rule<boost::spirit::classic::scanner<__gnu_cxx::__normal_iterator<char const*, std::string>, boost::spirit::classic::scanner_policies<boost::spirit::classic::skipper_iteration_policy<boost::spirit::classic::iteration_policy>, boost::spirit::classic::match_policy, boost::spirit::classic::action_policy> >, boost::spirit::classic::nil_t, boost::spirit::classic::nil_t>& boost::spirit::classic::rule<boost::spirit::classic::scanner<__gnu_cxx::__normal_iterator<char const*, std::string>, boost::spirit::classic::scanner_policies<boost::spirit::classic::skipper_iteration_policy<boost::spirit::classic::iteration_policy>, boost::spirit::classic::match_policy, boost::spirit::classic::action_policy> >, boost::spirit::classic::nil_t, boost::spirit::classic::nil_t>::operator=<boost::spirit::classic::sequence<boost::spirit::classic::rule<boost::spirit::classic::scanner<__gnu_cxx::__normal_iterator<char const*, std::string>, boost::spirit::classic::scanner_policies<boost::spirit::classic::skipper_iteration_policy<boost::spirit::classic::iteration_policy>, boost::spirit::classic::match_policy, boost::spirit::classic::action_policy> >, boost::spirit::classic::nil_t, boost::spirit::classic::nil_t>, boost::spirit::classic::kleene_star<boost::spirit::classic::alternative<boost::spirit::classic::sequence<boost::spirit::classic::chlit<char>, boost::spirit::classic::rule<boost::spirit::classic::scanner<__gnu_cxx::__normal_iterator<char const*, std::string>, boost::spirit::classic::scanner_policies<boost::spirit::classic::skipper_iteration_policy<boost::spirit::classic::iteration_policy>, boost::spirit::classic::match_policy, boost::spirit::classic::action_policy> >, boost::spirit::classic::nil_t, boost::spirit::classic::nil_t> >, boost::spirit::classic::chlit<char> > > > >(boost::spirit::classic::sequence<boost::spirit::classic::rule<boost::spiri---Type <return> to continue, or q <return> to quit---



more logs:

http://magna002.ceph.redhat.com/vasu-2016-06-11_14:40:22-rados-jewel---basic-clara/238562/teuthology.log


[ubuntu@clara010 ~]$ sudo ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 6.00000 root default
-3 6.00000 rack localrack
-2 6.00000 host localhost
0 1.00000 osd.0 up 1.00000 0.33659
1 1.00000 osd.1 up 1.00000 1.00000
2 1.00000 osd.2 up 1.00000 1.00000
3 1.00000 osd.3 down 1.00000 0.88971
4 1.00000 osd.4 up 1.00000 0
5 1.00000 osd.5 up 1.00000 1.00000

Comment 7 Samuel Just 2016-06-13 17:14:33 UTC
This only includes the first 3 threads...  Can you include all of them?  At the bottom, it mentions that you need to hit <return> to continue to the next page.

It also indicates how to get the debug symbols:
Missing separate debuginfos, use: debuginfo-install ceph-osd-10.2.1-7.el7cp.x86_64

Threads 27 and 28 linked above, however, do appear to be hung on a tcmalloc lock, which would be consistent with the tcmalloc bug.

Vasu: to confirm, please reproduce again and include *all* of the thread backtraces.

Comment 8 Christina Meno 2016-06-13 17:20:04 UTC
I think that this data suggests we can responsibly exclude this issue from blocking 2.0GA

Comment 9 Samuel Just 2016-06-13 17:30:01 UTC
Linking the tcmalloc fixup bz.

Comment 10 Vasu Kulkarni 2016-06-13 17:36:24 UTC
From clara003 for OSD.4 ( This is a different node )

#8  0x00007f5640b4b8ab in std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) () from /lib64/libstdc++.so.6
#9  0x00007f5640b4b954 in std::string::reserve(unsigned long) () from /lib64/libstdc++.so.6
#10 0x00007f5640b4bbbf in std::string::append(char const*, unsigned long) () from /lib64/libstdc++.so.6
#11 0x00007f5640b03864 in std::basic_istream<char, std::char_traits<char> >& std::getline<char, std::char_traits<char>, std::allocator<char> >(std::basic_istream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, char) () from /lib64/libstdc++.so.6
#12 0x00007f5643d0bce4 in dump_cmd_to_json(ceph::Formatter*, std::string const&) ()
#13 0x00007f5643d0cb4c in dump_cmd_and_help_to_json(ceph::Formatter*, std::string const&, std::string const&, std::string const&) ()
#14 0x00007f5643cead04 in GetdescsHook::call(std::string, std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >&, std::string, ceph::buffer::list&) ()
#15 0x00007f5643ce2f91 in AdminSocket::do_accept() ()
#16 0x00007f5643ce4948 in AdminSocket::entry() ()
#17 0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#18 0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 25 (Thread 0x7f563c241700 (LWP 361636)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5640b3e9ec in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x00007f56436d1a09 in ceph::timer_detail::timer<ceph::time_detail::mono_clock>::timer_thread() ()
#3  0x00007f5640b42220 in ?? () from /lib64/libstdc++.so.6
#4  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 24 (Thread 0x7f563ba40700 (LWP 361637)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643cc609f in SimpleMessenger::reaper_entry() ()
#2  0x00007f5643ccb9dd in SimpleMessenger::ReaperThread::entry() ()
#3  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 23 (Thread 0x7f563b23f700 (LWP 361638)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643cc609f in SimpleMessenger::reaper_entry() ()
#2  0x00007f5643ccb9dd in SimpleMessenger::ReaperThread::entry() ()
#3  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 22 (Thread 0x7f563aa3e700 (LWP 361639)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643cc609f in SimpleMessenger::reaper_entry() ()
#2  0x00007f5643ccb9dd in SimpleMessenger::ReaperThread::entry() ()
#3  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 21 (Thread 0x7f563a23d700 (LWP 361640)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643cc609f in SimpleMessenger::reaper_entry() ()
#2  0x00007f5643ccb9dd in SimpleMessenger::ReaperThread::entry() ()
#3  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 20 (Thread 0x7f5639a3c700 (LWP 361641)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#1  0x00007f5643cc609f in SimpleMessenger::reaper_entry() ()
#2  0x00007f5643ccb9dd in SimpleMessenger::ReaperThread::entry() ()
#3  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 19 (Thread 0x7f563923b700 (LWP 361642)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643cc609f in SimpleMessenger::reaper_entry() ()
#2  0x00007f5643ccb9dd in SimpleMessenger::ReaperThread::entry() ()
#3  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 18 (Thread 0x7f5638a3a700 (LWP 361643)):
#0  0x00007f5641c24f4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f5641c20d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f5641c20c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f5643c9cff8 in Mutex::Lock(bool) ()
#4  0x00007f5643cd8152 in SafeTimer::timer_thread() ()
#5  0x00007f5643cd9c5d in SafeTimerThread::entry() ()
#6  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7f5638239700 (LWP 361644)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643cd861f in SafeTimer::timer_thread() ()
#2  0x00007f5643cd9c5d in SafeTimerThread::entry() ()
#3  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7f5637a38700 (LWP 361645)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643cd861f in SafeTimer::timer_thread() ()
#2  0x00007f5643cd9c5d in SafeTimerThread::entry() ()
#3  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7f5637237700 (LWP 361646)):
#0  0x00007f5641c24f4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f5641c20d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f5641c20c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#3  0x00007f56430ba029 in tls_get_addr_tail () from /lib64/ld-linux-x86-64.so.2
#4  0x00007f5642e79849 in GetStackTrace_libunwind(void**, int, int) () from /lib64/libtcmalloc.so.4
#5  0x00007f5642e7a0be in GetStackTrace(void**, int, int) () from /lib64/libtcmalloc.so.4
#6  0x00007f5642e6b314 in tcmalloc::PageHeap::GrowHeap(unsigned long) () from /lib64/libtcmalloc.so.4
#7  0x00007f5642e6b633 in tcmalloc::PageHeap::New(unsigned long) () from /lib64/libtcmalloc.so.4
#8  0x00007f5642e69f64 in tcmalloc::CentralFreeList::Populate() () from /lib64/libtcmalloc.so.4
#9  0x00007f5642e6a148 in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () from /lib64/libtcmalloc.so.4
#10 0x00007f5642e6a1dd in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /lib64/libtcmalloc.so.4
#11 0x00007f5642e6d235 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) () from /lib64/libtcmalloc.so.4
#12 0x00007f5642e7df58 in tc_newarray () from /lib64/libtcmalloc.so.4
#13 0x00007f56425d6878 in snappy::Compress(snappy::Source*, snappy::Sink*) () from /lib64/libsnappy.so.1
#14 0x00007f56425d7271 in snappy::RawCompress(char const*, unsigned long, char*, unsigned long*) () from /lib64/libsnappy.so.1
#15 0x00007f56428158d4 in leveldb::TableBuilder::WriteBlock(leveldb::BlockBuilder*, leveldb::BlockHandle*) () from /lib64/libleveldb.so.1
#16 0x00007f5642815a64 in leveldb::TableBuilder::Flush() () from /lib64/libleveldb.so.1
#17 0x00007f5642815bfb in leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&) () from /lib64/libleveldb.so.1
#18 0x00007f56427f710b in leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*) () from /lib64/libleveldb.so.1
#19 0x00007f56427f78c3 in leveldb::DBImpl::BackgroundCompaction() () from /lib64/libleveldb.so.1
#20 0x00007f56427f84b0 in leveldb::DBImpl::BackgroundCall() () from /lib64/libleveldb.so.1
#21 0x00007f56428195cb in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /lib64/libleveldb.so.1
#22 0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#23 0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7f56365a8700 (LWP 361647)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f56439f0aff in WBThrottle::get_next_should_flush(boost::tuples::tuple<ghobject_t, std::shared_ptr<FDCache::FD>, WBThrottle::PendingWB, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>*) ()
#2  0x00007f56439f1182 in WBThrottle::entry() ()
#3  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f563505a700 (LWP 361648)):
#0  0x00007f5642e79573 in base::internal::SpinLockDelay(int volatile*, int, int) () from /lib64/libtcmalloc.so.4
#1  0x00007f5642e79447 in SpinLock::SlowLock() () from /lib64/libtcmalloc.so.4
#2  0x00007f5642e6a078 in tcmalloc::CentralFreeList::Populate() () from /lib64/libtcmalloc.so.4
#3  0x00007f5642e6a148 in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () from /lib64/libtcmalloc.so.4
#4  0x00007f5642e6a1dd in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /lib64/libtcmalloc.so.4
#5  0x00007f5642e6d235 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) () from /lib64/libtcmalloc.so.4
#6  0x00007f5642e7d718 in tc_new () from /lib64/libtcmalloc.so.4
#7  0x00007f5643c336cd in ceph::log::Log::create_entry(int, int, unsigned long*) ()
#8  0x00007f56439b042d in FileStore::sync_entry() ()
---Type <return> to continue, or q <return> to quit---
#9  0x00007f56439d977d in FileStore::SyncThread::entry() ()
#10 0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f5634859700 (LWP 361649)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f564369b95e in Cond::Wait(Mutex&) ()
#2  0x00007f5643a96604 in FileJournal::write_thread_entry() ()
#3  0x00007f56439d18dd in FileJournal::Writer::entry() ()
#4  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f5634058700 (LWP 361650)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643c19e38 in Finisher::finisher_thread_entry() ()
#2  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f5633857700 (LWP 361651)):
#0  0x00007f5641c22a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643ce02fa in ThreadPool::worker(ThreadPool::WorkThread*) ()
#2  0x00007f5643ce1590 in ThreadPool::WorkThread::entry() ()
#3  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f5633056700 (LWP 361652)):
#0  0x00007f5641c22a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643ce02fa in ThreadPool::worker(ThreadPool::WorkThread*) ()
#2  0x00007f5643ce1590 in ThreadPool::WorkThread::entry() ()
#3  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f5632855700 (LWP 361653)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643c19e38 in Finisher::finisher_thread_entry() ()
#2  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f5632054700 (LWP 361654)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#1  0x00007f5643c19e38 in Finisher::finisher_thread_entry() ()
#2  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f5631853700 (LWP 361655)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643c19e38 in Finisher::finisher_thread_entry() ()
#2  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f5631052700 (LWP 361656)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643c19e38 in Finisher::finisher_thread_entry() ()
#2  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f5630851700 (LWP 361657)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643c19e38 in Finisher::finisher_thread_entry() ()
#2  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f5630050700 (LWP 361658)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643c19e38 in Finisher::finisher_thread_entry() ()
#2  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f562f84f700 (LWP 361659)):
#0  0x00007f5641c226d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5643cd861f in SafeTimer::timer_thread() ()
#2  0x00007f5643cd9c5d in SafeTimerThread::entry() ()
#3  0x00007f5641c1edc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f56402a9ced in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f56432be800 (LWP 361628)):
#0  0x00007f5642e79573 in base::internal::SpinLockDelay(int volatile*, int, int) () from /lib64/libtcmalloc.so.4
#1  0x00007f5642e79447 in SpinLock::SlowLock() () from /lib64/libtcmalloc.so.4
#2  0x00007f5642e6a078 in tcmalloc::CentralFreeList::Populate() () from /lib64/libtcmalloc.so.4
#3  0x00007f5642e6a148 in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () from /lib64/libtcmalloc.so.4
---Type <return> to continue, or q <return> to quit---
#4  0x00007f5642e6a1dd in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /lib64/libtcmalloc.so.4
#5  0x00007f5642e6d235 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) () from /lib64/libtcmalloc.so.4
#6  0x00007f5642e7d718 in tc_new () from /lib64/libtcmalloc.so.4
#7  0x00007f5640b4ac69 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) () from /lib64/libstdc++.so.6
#8  0x00007f5640b4c521 in char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag) ()
   from /lib64/libstdc++.so.6
#9  0x00007f5640b4c958 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) ()
   from /lib64/libstdc++.so.6
#10 0x00007f562ec0e1ea in ?? () from /usr/lib64/rados-classes/libcls_rbd.so
#11 0x00007f56430c83a3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#12 0x00007f56430ccab6 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#13 0x00007f56430c81b4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#14 0x00007f56430cc1ab in _dl_open () from /lib64/ld-linux-x86-64.so.2
#15 0x00007f5641a1402b in dlopen_doit () from /lib64/libdl.so.2
#16 0x00007f56430c81b4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#17 0x00007f5641a1462d in _dlerror_run () from /lib64/libdl.so.2
#18 0x00007f5641a140c1 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#19 0x00007f56436eb891 in ClassHandler::_load_class(ClassHandler::ClassData*) ()
#20 0x00007f56436ebc9f in ClassHandler::open_class(std::string const&, ClassHandler::ClassData**) ()
#21 0x00007f56436ebf79 in ClassHandler::open_all_classes() ()
#22 0x00007f5643681dbc in OSD::init() ()
#23 0x00007f56435e5c15 in main ()
(gdb) 
(gdb)

Comment 11 Samuel Just 2016-06-13 17:48:31 UTC
t1:
dlopen
CentralFreeList::Populate()

t15:
CentralFreeList::Populate()
tls_get_addr_tail()

So yeah, this appears to be exactly the same as the linked bugs.  The bug is in tcmalloc and the right fix would be to ship a patched tcmalloc (detailed in the upstream bug).

Comment 21 Vasu Kulkarni 2016-07-19 19:07:51 UTC
Verified in 10.2.2-21 and 22.

Comment 23 errata-xmlrpc 2016-08-23 19:40:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html