1593110 – Ceph mgr daemon crashing after starting balancer module in automatic mode

Bug 1593110 - Ceph mgr daemon crashing after starting balancer module in automatic mode

Summary: Ceph mgr daemon crashing after starting balancer module in automatic mode

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	3.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	z2
Target Release:	3.2
Assignee:	Brad Hubbard
QA Contact:	Manohar Murthy
Docs Contact:	John Brier
URL:
Whiteboard:
Depends On:
Blocks:	1629656
TreeView+	depends on / blocked

Reported:	2018-06-20 05:30 UTC by liuwei
Modified:	2024-03-25 15:05 UTC (History)
CC List:	12 users (show)
Fixed In Version:	RHEL: ceph-12.2.8-113.el7cp Ubuntu: ceph_12.2.8-96redhat1xenial
Doc Type:	Bug Fix
Doc Text:	.The `ceph-mgr` daemon no longer crashes after starting balancer module in automatic mode Previously, due to a CRUSH bug, invalid mappings were created. When an invalid mapping was encountered in the `_apply_upmap` function, the code caused a segmentation fault. With this release, the code has been updated to check that the values are within an expected range. If not, the invalid values are ignored.
Clone Of:
Environment:
Last Closed:	2019-04-30 15:56:43 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	22056	None	None	None	2018-06-20 05:32:45 UTC
Red Hat Bugzilla	1612623	medium	CLOSED	Crush can produce invalid mappings	2021-12-10 17:02:00 UTC
Red Hat Issue Tracker	RHCEPH-2657	None	None	None	2021-12-10 16:45:16 UTC
Red Hat Product Errata	RHSA-2019:0911	None	None	None	2019-04-30 15:57:00 UTC

Internal Links: 1612623

Description liuwei 2018-06-20 05:30:02 UTC

Description of problem:

I was testing the auto balancer mgr module, and it seemed to be fine running in automatic mode (ceph balancer on), and making changes to pg mappings as expected. I left it over the weekend and came back to find the mgr daemon had crashed.

I have since tried to restart it but it just crashes again. The crash messages are below shown:

 -3> 2018-06-20 12:58:38.408927 7fc69bd27700 10   trying 11.11
    -2> 2018-06-20 12:58:38.408975 7fc69bd27700 10   11.11 [24,25,0] -> [31,-823648512,22064]
    -1> 2018-06-20 12:58:38.408982 7fc69bd27700 10   11.11 pg_upmap_items [24,31,25,-823648512,0,22064]
     0> 2018-06-20 12:58:38.411079 7fc69bd27700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fc69bd27700 thread_name:balancer

 ceph version 12.2.4-10.el7cp (03fd19535b3701f3322c68b5f424335d6fc8dd66) luminous (stable)
 1: (()+0x3eeb51) [0x5630c476bb51]
 2: (()+0xf680) [0x7fc6b4fe8680]
 3: (OSDMap::_apply_upmap(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*) const+0x17d) [0x5630c4885bfd]
 4: (OSDMap::_pg_to_up_acting_osds(pg_t const&, std::vector<int, std::allocator<int> >*, int*, std::vector<int, std::allocator<int> >*, int*, bool) const+0x1c0) [0x5630c48995a0]
 5: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long, std::less<long>, std::allocator<long> > const&, OSDMap::Incremental*)+0x2d9) [0x5630c489a7e9]
 6: (()+0x2e3eea) [0x5630c4660eea]
 7: (PyEval_EvalFrameEx()+0x6df0) [0x7fc6b6f4ecf0]
 8: (PyEval_EvalCodeEx()+0x7ed) [0x7fc6b6f5103d]
 9: (PyEval_EvalFrameEx()+0x663c) [0x7fc6b6f4e53c]
 10: (PyEval_EvalFrameEx()+0x67bd) [0x7fc6b6f4e6bd]
 11: (PyEval_EvalFrameEx()+0x67bd) [0x7fc6b6f4e6bd]
 12: (PyEval_EvalCodeEx()+0x7ed) [0x7fc6b6f5103d]
 13: (()+0x70978) [0x7fc6b6eda978]
 14: (PyObject_Call()+0x43) [0x7fc6b6eb5a63]
 15: (()+0x5aa55) [0x7fc6b6ec4a55]
 16: (PyObject_Call()+0x43) [0x7fc6b6eb5a63]
 17: (()+0x4bb45) [0x7fc6b6eb5b45]
 18: (PyObject_CallMethod()+0xbb) [0x7fc6b6eb5e7b]
 19: (PyModuleRunner::serve()+0x5c) [0x5630c465eadc]
 20: (PyModuleRunner::PyModuleRunnerThread::entry()+0x6f) [0x5630c465f15f]
 21: (()+0x7dd5) [0x7fc6b4fe0dd5]
 22: (clone()+0x6d) [0x7fc6b40bcb3d]


Version-Release number of selected component (if applicable):

Ceph3.0 

How reproducible:

100%reproduced
Steps to Reproduce:
1.
2.
3.

Actual results:



Expected results:


Additional info:

Comment 4 Brad Hubbard 2018-06-20 07:58:40 UTC

Using a binary identical environment...

# gdb -q /usr/bin/ceph-osd                                                                                                                                                                                             
Reading symbols from /usr/bin/ceph-osd...Reading symbols from /usr/lib/debug/usr/bin/ceph-osd.debug...done.
done.
(gdb) p &OSDMap::_apply_upmap
$1 = (void (OSDMap::*)(const OSDMap * const, const pg_pool_t &, pg_t, std::vector<int, std::allocator<int> > *)) 0xb36330 <OSDMap::_apply_upmap(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*) const>

So that's the address of our function, 0xb36330. Now we can find the exact point where we crashed by adding 0x17d (the offset into the function given in frame 3 of the backtrace above) to 0xb36330.

(gdb) p/x 0xb36330+0x17d
$2 = 0xb364ad

Also note that the decimal of 0x17d is 381.

(gdb) p/d 0x17d
$4 = 381

This can also be used to find the correct instruction by the offset (+381).

The following command disassembles the function containing the address 0xb364ad and interleaves the source code.

(gdb) disass /m 0xb364ad
...
1979                pos < 0 &&
   0x0000000000b36492 <+354>:   cmp    %edx,%edi
   0x0000000000b36494 <+356>:   jne    0xb36480 <OSDMap::_apply_upmap(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*) const+336>
   0x0000000000b3649c <+364>:   cmp    $0x7fffffff,%esi
   0x0000000000b364a2 <+370>:   je     0xb364b5 <OSDMap::_apply_upmap(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*) const+389>

1980                !(r.second != CRUSH_ITEM_NONE && r.second < max_osd &&
   0x0000000000b364a4 <+372>:   cmp    %esi,0x38(%rbx)
   0x0000000000b364a7 <+375>:   jle    0xb364b5 <OSDMap::_apply_upmap(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*) const+389>

1981                  osd_weight[r.second] == 0)) {
   0x0000000000b36467 <+311>:   movslq %esi,%rax
   0x0000000000b3646a <+314>:   mov    (%r10),%edi
   0x0000000000b364a9 <+377>:   mov    0x78(%rbx),%rdx
   0x0000000000b364ad <+381>:   mov    (%rdx,%r12,1),%edx   <---- HERE
   0x0000000000b364b1 <+385>:   test   %edx,%edx
   0x0000000000b364b3 <+387>:   je     0xb36480 <OSDMap::_apply_upmap(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*) const+336>

(gdb) l 1981
1976            }
1977            // ignore mapping if target is marked out (or invalid osd id)
1978            if (osd == r.first &&
1979                pos < 0 &&
1980                !(r.second != CRUSH_ITEM_NONE && r.second < max_osd &&
1981                  osd_weight[r.second] == 0)) {    <---- HERE
1982              pos = i;
1983            }
1984          }
1985          if (!exists && pos >= 0) {

So to me this looks like we are indexing outside the bounds of the osd_weight array so far that we actually tried to access some memory that caused a segfault which looks like the -823648512 value is the culprit to me. If we had a coredump we could verify that.

The naive solution might be to check if r.second < 0 and ignore it if it is but I want to look further into why/how this came about and the best way to solve it going forward. That will involve tracing where that -823648512 value is coming from. If we can get a coredump that might help, especially if the customer can recreate this easily. I'll continue with this tomorrow morning.

Comment 7 Brad Hubbard 2018-08-03 22:53:01 UTC

I have a solution for the segfault going into master (details in upstream tracker). I'll create a separate bug for the python balancer code sending negative values to the mgr but, with this fix in place, those values will be ignored.

Comment 8 Ben England 2018-09-06 18:45:56 UTC

Hi Brad,

This would be really, really important to get working in RHCS 3 for large clusters, I'm having a similar problem with it.  This was RHOSP (OpenStack) 13 GA, which is a LTS and therefore widely used.  If you get the fix into RHCS 3 it should then make its way into RHOSP 13 via the Ceph container images.

What I saw is that I would enable the balancer module, try to run it and it would no longer be enabled.  

[root@overcloud-controller-2 ~]# ceph mgr module enable balancer
[root@overcloud-controller-2 ~]# ceph balancer eval
Error EINVAL: No handler found for 'balancer eval'
[root@overcloud-controller-2 ~]# ceph mgr module ls
{
    "enabled_modules": [],
    "disabled_modules": [
        "balancer",
        "dashboard",
        "influx",
        "localpool",
        "prometheus",
        "restful",
        "selftest",
        "status",
        "zabbix"
    ]
}

[root@overcloud-controller-2 ~]# rpm -qa | grep ceph
...
ceph-common-12.2.4-10.el7cp.x86_64

For a large cluster, the regular PG distribution across OSDs can lead to very inefficient operation, where a couple of OSDs are running with 20-30% more load and slowing down the entire cluster just because they have more PGs than everyone else.  To some extent this can be ameliorated by "ceph osd reweight-by-utilization", but I was looking forward to having this tool to deal with it, particularly in upmap mode.

Comment 9 Brad Hubbard 2018-09-06 23:12:10 UTC

(In reply to Ben England from comment #8)
> Hi Brad,
> 
> This would be really, really important to get working in RHCS 3 for large
> clusters, I'm having a similar problem with it.  This was RHOSP (OpenStack)
> 13 GA, which is a LTS and therefore widely used.  If you get the fix into
> RHCS 3 it should then make its way into RHOSP 13 via the Ceph container
> images.
> 
> What I saw is that I would enable the balancer module, try to run it and it
> would no longer be enabled.  
> 
> [root@overcloud-controller-2 ~]# ceph mgr module enable balancer
> [root@overcloud-controller-2 ~]# ceph balancer eval
> Error EINVAL: No handler found for 'balancer eval'
> [root@overcloud-controller-2 ~]# ceph mgr module ls
> {
>     "enabled_modules": [],
>     "disabled_modules": [
>         "balancer",
>         "dashboard",
>         "influx",
>         "localpool",
>         "prometheus",
>         "restful",
>         "selftest",
>         "status",
>         "zabbix"
>     ]
> }
> 
> [root@overcloud-controller-2 ~]# rpm -qa | grep ceph
> ...
> ceph-common-12.2.4-10.el7cp.x86_64
> 
> For a large cluster, the regular PG distribution across OSDs can lead to
> very inefficient operation, where a couple of OSDs are running with 20-30%
> more load and slowing down the entire cluster just because they have more
> PGs than everyone else.  To some extent this can be ameliorated by "ceph osd
> reweight-by-utilization", but I was looking forward to having this tool to
> deal with it, particularly in upmap mode.

Hi Ben,

https://bugzilla.redhat.com/show_bug.cgi?id=1612623 is the actual issue, this segfault won't occur if that is resolved. Perhaps an adjustment of priority/severity of that bug is in order?

Comment 20 errata-xmlrpc 2019-04-30 15:56:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0911

Note You need to log in before you can comment on or make changes to this bug.