Created attachment 1022666 [details] All log:- osd, mon Description of problem: A cluster with 3 monitors were created, all the monitors got crashed after doing multiple crush map edits Version-Release number of selected component (if applicable): [root@hp-ms-01-c05 home]# rpm -qa| grep ceph ceph-common-0.94.1-5.el7cp.x86_64 ceph-mon-0.94.1-5.el7cp.x86_64 ceph-0.94.1-5.el7cp.x86_64 ceph-osd-0.94.1-5.el7cp.x86_64 How reproducible: Tried only once Steps to Reproduce: 1.created a cluster with 3 monitors and 5 osds 2.started doing rados put operations 3.to induce misplaced pgs , I edited crush map by changing some values in the rules part of the crush map 4. After doing so many edits and bringing osds down , out and in suddenly monitor became unresponsive Actual results: All the monitors got crashed Expected results: Additional info: Backtrace ========= -- begin dump of recent events --- 0> 2015-05-06 07:28:28.359165 7fad29cf5700 -1 *** Caught signal (Aborted) ** in thread 7fad29cf5700 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: /usr/bin/ceph-mon() [0x9017e2] 2: (()+0xf130) [0x7fad3005b130] 3: (gsignal()+0x37) [0x7fad2ea755d7] 4: (abort()+0x148) [0x7fad2ea76cc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fad2f3799b5] 6: (()+0x5e926) [0x7fad2f377926] 7: (()+0x5e953) [0x7fad2f377953] 8: (()+0x5eb73) [0x7fad2f377b73] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0x7b361a] 10: (PGMap::get_filtered_pg_stats(std::string&, long, long, bool, std::set<pg_t, std::less<pg_t>, std::allocator<pg_t> >&)+0x1d3) [0x885ea3] 11: (PGMonitor::preprocess_command(MMonCommand*)+0x1ccd) [0x66398d] 12: (PGMonitor::preprocess_query(PaxosServiceMessage*)+0x27f) [0x66584f] 13: (PaxosService::dispatch(PaxosServiceMessage*)+0x833) [0x5cacd3] 14: (Monitor::handle_command(MMonCommand*)+0x1549) [0x591b19] 15: (Monitor::dispatch(MonSession*, Message*, bool)+0xf9) [0x594c89] 16: (Monitor::_ms_dispatch(Message*)+0x1a6) [0x595936] 17: (Monitor::ms_dispatch(Message*)+0x23) [0x5b5403] 18: (DispatchQueue::entry()+0x64a) [0x8a1d9a] 19: (DispatchQueue::DispatchThread::entry()+0xd) [0x79bd9d] 20: (()+0x7df5) [0x7fad30053df5] 21: (clone()+0x6d) [0x7fad2eb361ad] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. No core genereated since ulimit was set to 0 by default cluster status ======== [root@hp-ms-01-c02 ~]# ceph -s cluster f75f054f-b849-4832-9b63-a71cec24bdc6 health HEALTH_WARN 8 pgs degraded 5 pgs stuck degraded 106 pgs stuck unclean 5 pgs stuck undersized 8 pgs undersized recovery 1/60 objects degraded (1.667%) recovery 2/60 objects misplaced (3.333%) too many PGs per OSD (404 > max 300) monmap e1: 3 mons at {mon1=10.12.27.2:6789/0,mon2=10.12.27.3:6789/0,mon3=10.12.27.5:6789/0} election epoch 8, quorum 0,1,2 mon1,mon2,mon3 osdmap e160: 6 osds: 5 up, 4 in; 106 remapped pgs pgmap v1482: 576 pgs, 2 pools, 18400 bytes data, 20 objects 40169 MB used, 360 GB / 399 GB avail 1/60 objects degraded (1.667%) 2/60 objects misplaced (3.333%) 470 active+clean 98 active+remapped 5 active+undersized+degraded+remapped 3 active+undersized+degraded Edits performed on crush map =========================== rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } above rule was modified to ule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 1 type host step emit } above change in crush map triggered some of the misplaced pgs , then brought down one of the osd which makes cluster degraded + misplaced pgs to appear. This operation repeated couple of times then saw the crash Attaching all the logs with the bug crush map ========== # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable straw_calc_version 1 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host osd1 { id -2 # do not change unnecessarily # weight 1.100 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 item osd.5 weight 0.100 } host osd2 { id -4 # do not change unnecessarily # weight 0.100 alg straw hash 0 # rjenkins1 item osd.1 weight 0.100 } host osd3 { id -3 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.2 weight 1.000 } host osd4 { id -5 # do not change unnecessarily # weight 0.100 alg straw hash 0 # rjenkins1 item osd.3 weight 0.100 } host osd5 { id -6 # do not change unnecessarily # weight 0.100 alg straw hash 0 # rjenkins1 item osd.4 weight 0.100 } root default { id -1 # do not change unnecessarily # weight 2.400 alg straw hash 0 # rjenkins1 item osd1 weight 1.100 item osd2 weight 0.100 item osd3 weight 1.000 item osd4 weight 0.100 item osd5 weight 0.100 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map
Greg would you mind taking a look at this one (or re-assigning as appropriate?)
This crash can only have been caused by some tool issuing a request to "pg ls" (with specific states!) on the monitor. Was this done explicitly, or perhaps as part of Calamari? It looks like this is in some new features added to support Calamari functionality and I have no idea how often it's used. I have crated an upstream bug to fix what I think is the actual issue: http://tracker.ceph.com/issues/11569 In other news, the crush change you made is fairly nonsensical, since it's reducing every PG's mapping to size one (although the OSD will maintain the previous mappings in pg_temp in order to keep the sizes requested as replicated). Not sure if that's deliberate or not. Assigning this to Kefu to dig into farther since we're trying to bring him up on the monitors.
(In reply to Greg Farnum from comment #3) > This crash can only have been caused by some tool issuing a request to "pg > ls" (with specific states!) on the monitor. Was this done explicitly, or > perhaps as part of Calamari? It looks like this is in some new features This opeartion was done explicitly i.e 1. get the crushmap 2. edit the map 3. put it back to the cluster 4. check ceph -s and ceph pg dump I am not using calamari. > added to support Calamari functionality and I have no idea how often it's > used. > > I have crated an upstream bug to fix what I think is the actual issue: > http://tracker.ceph.com/issues/11569 > > In other news, the crush change you made is fairly nonsensical, since it's > reducing every PG's mapping to size one (although the OSD will maintain the > previous mappings in pg_temp in order to keep the sizes requested as > replicated). Not sure if that's deliberate or not. This was done deliberately to create misplaced PGs which is necessary condition for the test I am doing. > > Assigning this to Kefu to dig into farther since we're trying to bring him > up on the monitors.
Looks like the code is still undergoing review into master upstream (https://github.com/ceph/ceph/pull/4643) From what I understand, this crash is pretty rare, right? Based on that assumption I'm going to un-target this bugfix from the 1.3.0 release
> From what I understand, this crash is pretty rare, right? ken, as long as user does not send "pg ls* recovery" to ceph cli. we are good. > Based on that assumption I'm going to un-target this bugfix from the 1.3.0 release thank you!
pending on backport: http://tracker.ceph.com/issues/11910
Thanks Kefu! We should be able to pull those patches in downstream in time for 1.3.1.
Patches to pull downstream: https://github.com/ceph/ceph/pull/5160/commits
Verified on rpm -qa| grep ceph ceph-mon-0.94.3-1.el7cp.x86_64 ceph-common-0.94.3-1.el7cp.x86_64 ceph-0.94.3-1.el7cp.x86_64 Even after many crush map edits I dont see any mon crash , hence marking this as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2512
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2066