Description of problem: I was testing the auto balancer mgr module, and it seemed to be fine running in automatic mode (ceph balancer on), and making changes to pg mappings as expected. I left it over the weekend and came back to find the mgr daemon had crashed. I have since tried to restart it but it just crashes again. The crash messages are below shown: -3> 2018-06-20 12:58:38.408927 7fc69bd27700 10 trying 11.11 -2> 2018-06-20 12:58:38.408975 7fc69bd27700 10 11.11 [24,25,0] -> [31,-823648512,22064] -1> 2018-06-20 12:58:38.408982 7fc69bd27700 10 11.11 pg_upmap_items [24,31,25,-823648512,0,22064] 0> 2018-06-20 12:58:38.411079 7fc69bd27700 -1 *** Caught signal (Segmentation fault) ** in thread 7fc69bd27700 thread_name:balancer ceph version 12.2.4-10.el7cp (03fd19535b3701f3322c68b5f424335d6fc8dd66) luminous (stable) 1: (()+0x3eeb51) [0x5630c476bb51] 2: (()+0xf680) [0x7fc6b4fe8680] 3: (OSDMap::_apply_upmap(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*) const+0x17d) [0x5630c4885bfd] 4: (OSDMap::_pg_to_up_acting_osds(pg_t const&, std::vector<int, std::allocator<int> >*, int*, std::vector<int, std::allocator<int> >*, int*, bool) const+0x1c0) [0x5630c48995a0] 5: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*)+0x2d9) [0x5630c489a7e9] 6: (()+0x2e3eea) [0x5630c4660eea] 7: (PyEval_EvalFrameEx()+0x6df0) [0x7fc6b6f4ecf0] 8: (PyEval_EvalCodeEx()+0x7ed) [0x7fc6b6f5103d] 9: (PyEval_EvalFrameEx()+0x663c) [0x7fc6b6f4e53c] 10: (PyEval_EvalFrameEx()+0x67bd) [0x7fc6b6f4e6bd] 11: (PyEval_EvalFrameEx()+0x67bd) [0x7fc6b6f4e6bd] 12: (PyEval_EvalCodeEx()+0x7ed) [0x7fc6b6f5103d] 13: (()+0x70978) [0x7fc6b6eda978] 14: (PyObject_Call()+0x43) [0x7fc6b6eb5a63] 15: (()+0x5aa55) [0x7fc6b6ec4a55] 16: (PyObject_Call()+0x43) [0x7fc6b6eb5a63] 17: (()+0x4bb45) [0x7fc6b6eb5b45] 18: (PyObject_CallMethod()+0xbb) [0x7fc6b6eb5e7b] 19: (PyModuleRunner::serve()+0x5c) [0x5630c465eadc] 20: (PyModuleRunner::PyModuleRunnerThread::entry()+0x6f) [0x5630c465f15f] 21: (()+0x7dd5) [0x7fc6b4fe0dd5] 22: (clone()+0x6d) [0x7fc6b40bcb3d] Version-Release number of selected component (if applicable): Ceph3.0 How reproducible: 100%reproduced Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Using a binary identical environment... # gdb -q /usr/bin/ceph-osd Reading symbols from /usr/bin/ceph-osd...Reading symbols from /usr/lib/debug/usr/bin/ceph-osd.debug...done. done. (gdb) p &OSDMap::_apply_upmap $1 = (void (OSDMap::*)(const OSDMap * const, const pg_pool_t &, pg_t, std::vector<int, std::allocator<int> > *)) 0xb36330 <OSDMap::_apply_upmap(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*) const> So that's the address of our function, 0xb36330. Now we can find the exact point where we crashed by adding 0x17d (the offset into the function given in frame 3 of the backtrace above) to 0xb36330. (gdb) p/x 0xb36330+0x17d $2 = 0xb364ad Also note that the decimal of 0x17d is 381. (gdb) p/d 0x17d $4 = 381 This can also be used to find the correct instruction by the offset (+381). The following command disassembles the function containing the address 0xb364ad and interleaves the source code. (gdb) disass /m 0xb364ad ... 1979 pos < 0 && 0x0000000000b36492 <+354>: cmp %edx,%edi 0x0000000000b36494 <+356>: jne 0xb36480 <OSDMap::_apply_upmap(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*) const+336> 0x0000000000b3649c <+364>: cmp $0x7fffffff,%esi 0x0000000000b364a2 <+370>: je 0xb364b5 <OSDMap::_apply_upmap(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*) const+389> 1980 !(r.second != CRUSH_ITEM_NONE && r.second < max_osd && 0x0000000000b364a4 <+372>: cmp %esi,0x38(%rbx) 0x0000000000b364a7 <+375>: jle 0xb364b5 <OSDMap::_apply_upmap(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*) const+389> 1981 osd_weight[r.second] == 0)) { 0x0000000000b36467 <+311>: movslq %esi,%rax 0x0000000000b3646a <+314>: mov (%r10),%edi 0x0000000000b364a9 <+377>: mov 0x78(%rbx),%rdx 0x0000000000b364ad <+381>: mov (%rdx,%r12,1),%edx <---- HERE 0x0000000000b364b1 <+385>: test %edx,%edx 0x0000000000b364b3 <+387>: je 0xb36480 <OSDMap::_apply_upmap(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*) const+336> (gdb) l 1981 1976 } 1977 // ignore mapping if target is marked out (or invalid osd id) 1978 if (osd == r.first && 1979 pos < 0 && 1980 !(r.second != CRUSH_ITEM_NONE && r.second < max_osd && 1981 osd_weight[r.second] == 0)) { <---- HERE 1982 pos = i; 1983 } 1984 } 1985 if (!exists && pos >= 0) { So to me this looks like we are indexing outside the bounds of the osd_weight array so far that we actually tried to access some memory that caused a segfault which looks like the -823648512 value is the culprit to me. If we had a coredump we could verify that. The naive solution might be to check if r.second < 0 and ignore it if it is but I want to look further into why/how this came about and the best way to solve it going forward. That will involve tracing where that -823648512 value is coming from. If we can get a coredump that might help, especially if the customer can recreate this easily. I'll continue with this tomorrow morning.
I have a solution for the segfault going into master (details in upstream tracker). I'll create a separate bug for the python balancer code sending negative values to the mgr but, with this fix in place, those values will be ignored.
Hi Brad, This would be really, really important to get working in RHCS 3 for large clusters, I'm having a similar problem with it. This was RHOSP (OpenStack) 13 GA, which is a LTS and therefore widely used. If you get the fix into RHCS 3 it should then make its way into RHOSP 13 via the Ceph container images. What I saw is that I would enable the balancer module, try to run it and it would no longer be enabled. [root@overcloud-controller-2 ~]# ceph mgr module enable balancer [root@overcloud-controller-2 ~]# ceph balancer eval Error EINVAL: No handler found for 'balancer eval' [root@overcloud-controller-2 ~]# ceph mgr module ls { "enabled_modules": [], "disabled_modules": [ "balancer", "dashboard", "influx", "localpool", "prometheus", "restful", "selftest", "status", "zabbix" ] } [root@overcloud-controller-2 ~]# rpm -qa | grep ceph ... ceph-common-12.2.4-10.el7cp.x86_64 For a large cluster, the regular PG distribution across OSDs can lead to very inefficient operation, where a couple of OSDs are running with 20-30% more load and slowing down the entire cluster just because they have more PGs than everyone else. To some extent this can be ameliorated by "ceph osd reweight-by-utilization", but I was looking forward to having this tool to deal with it, particularly in upmap mode.
(In reply to Ben England from comment #8) > Hi Brad, > > This would be really, really important to get working in RHCS 3 for large > clusters, I'm having a similar problem with it. This was RHOSP (OpenStack) > 13 GA, which is a LTS and therefore widely used. If you get the fix into > RHCS 3 it should then make its way into RHOSP 13 via the Ceph container > images. > > What I saw is that I would enable the balancer module, try to run it and it > would no longer be enabled. > > [root@overcloud-controller-2 ~]# ceph mgr module enable balancer > [root@overcloud-controller-2 ~]# ceph balancer eval > Error EINVAL: No handler found for 'balancer eval' > [root@overcloud-controller-2 ~]# ceph mgr module ls > { > "enabled_modules": [], > "disabled_modules": [ > "balancer", > "dashboard", > "influx", > "localpool", > "prometheus", > "restful", > "selftest", > "status", > "zabbix" > ] > } > > [root@overcloud-controller-2 ~]# rpm -qa | grep ceph > ... > ceph-common-12.2.4-10.el7cp.x86_64 > > For a large cluster, the regular PG distribution across OSDs can lead to > very inefficient operation, where a couple of OSDs are running with 20-30% > more load and slowing down the entire cluster just because they have more > PGs than everyone else. To some extent this can be ameliorated by "ceph osd > reweight-by-utilization", but I was looking forward to having this tool to > deal with it, particularly in upmap mode. Hi Ben, https://bugzilla.redhat.com/show_bug.cgi?id=1612623 is the actual issue, this segfault won't occur if that is resolved. Perhaps an adjustment of priority/severity of that bug is in order?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:0911