Bug 1825312

Summary:	[ceph-mgr] memory growth leading to oom kill
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Preethi <pnataraj>
Component:	RADOS	Assignee:	Neha Ojha <nojha>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Preethi <pnataraj>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	4.1	CC:	akupczyk, bhubbard, ceph-eng-bugs, dzafman, epuertat, hyelloji, jdurgin, kchai, mnelson, nojha, rzarzyns, sseshasa, tchandra
Target Milestone:	z1
Target Release:	4.1
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-06-10 05:17:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Preethi 2020-04-17 16:26:35 UTC

Description of problem:Seeing mgr crash on the latest ceph 4.1 cluster.


Version-Release number of selected component (if applicable):
[root@magna057 ubuntu]# ceph -v
ceph version 14.2.8-28.el7cp (b32eac9fd60c00c62a2d3c85d88b483be7b55ba1) nautilus (stable)



How reproducible:


Steps to Reproduce:
1. Performed rolling upgrade from 4.0 to 4.1 ceph cluster
2. health status was OK after deployment
3. Ran IOs on OSDs for 30 min
4. System was idle for a day and later we see mgr crash in ceph -s status

Actual results: crash is noticed on mgr node i.e magna057 root/q


Expected results: No crash should be seen


Ceph status:
[root@magna057 ubuntu]# ceph -s
  cluster:
    id:     4828eef5-70fe-4bab-a5c8-5d3ec27f97ed
    health: HEALTH_WARN
            1 pool(s) have non-power-of-two pg_num
            application not enabled on 1 pool(s)
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum magna057,magna067,magna120 (age 64m)
    mgr: magna057(active, since 13h), standbys: magna120
    osd: 9 osds: 9 up (since 100m), 9 in (since 100m)
    rgw: 2 daemons active (magna057.rgw0, magna067.rgw0)
 
  task status:
 
  data:
    pools:   5 pools, 132 pgs
    objects: 116.87k objects, 458 GiB
    usage:   1.3 TiB used, 6.8 TiB / 8.2 TiB avail
    pgs:     132 active+clean



Additional info:


Error snippets:

     0> 2020-04-17 02:45:39.029 7fbf5577c700 -1 *** Caught signal (Aborted) **
 in thread 7fbf5577c700 thread_name:ms_dispatch

 ceph version 14.2.8-28.el7cp (b32eac9fd60c00c62a2d3c85d88b483be7b55ba1) nautilus (stable)
 1: (()+0xf630) [0x7fbf5ead4630]
 2: (gsignal()+0x37) [0x7fbf5d6b2387]
 3: (abort()+0x148) [0x7fbf5d6b3a78]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fbf5dfc27d5]
 5: (()+0x5e746) [0x7fbf5dfc0746]
 6: (()+0x5e773) [0x7fbf5dfc0773]
 7: (()+0x5e993) [0x7fbf5dfc0993]
 8: (()+0x1772b) [0x7fbf605dc72b]
 9: (tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0xf3) [0x7fbf605faa03]
 10: (void std::vector<std::pair<Context*, int>, std::allocator<std::pair<Context*, int> > >::_M_realloc_insert<std::pair<Context*, int> >(__gnu_cxx::__normal_iterator<std::pair<Context*, int>*, std::vector<std::pair<Context*, int>, std::allocator<std::pair<Context*, int> > > >, std::pair<Context*, int>&&)+0x56) [0x5625e6034976]
 11: (Finisher::queue(Context*, int)+0xe7) [0x5625e6034b77]
 12: (ActivePyModules::notify_all(LogEntry const&)+0x109) [0x5625e602a839]
 13: (Mgr::handle_log(MLog*)+0x58) [0x5625e60a1928]
 14: (Mgr::ms_dispatch(Message*)+0x22b) [0x5625e60a889b]
 15: (MgrStandby::ms_dispatch(Message*)+0x97) [0x5625e60b0157]
 16: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x5625e609a916]
 17: (DispatchQueue::entry()+0x1699) [0x7fbf61400c29]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fbf614ae0ad]
 19: (()+0x7ea5) [0x7fbf5eaccea5]
 20: (clone()+0x6d) [0x7fbf5d77a8dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore


configuration details: root/q
[grafana-server]
magna122

[mons]
magna120
magna067
magna057



[osds]
magna025 osd_scenario="lvm" osd_objectstore="bluestore" lvm_volumes="[{'data': '/dev/sdb'},{'data': '/dev/sdc'},{'data': '/dev/sdd'}]"
magna023 osd_scenario="lvm" osd_objectstore="bluestore" lvm_volumes="[{'data': '/dev/sdb'},{'data': '/dev/sdc'},{'data': '/dev/sdd'}]"
magna021 osd_scenario="lvm" osd_objectstore="bluestore" lvm_volumes="[{'data': '/dev/sdb'},{'data': '/dev/sdc'},{'data': '/dev/sdd'}]"


[mgrs]
magna120
magna057

[rgws]
magna067
magna057


logs can be seen in active mgr node: magna057
/var/log/ceph -> ceph-mgr.magna057.log-20200417 <filename>

Comment 1 Josh Durgin 2020-04-17 21:43:09 UTC

That crash indicates the mgr ran out of memory. Looking at the host it's on, magna120, it appears the monitor there is using excessive memory (almost 7GiB), but it is all in tcmalloc's freelist (out of the monitor's control). 'heap release' fixes it:

[root@magna120 ~]# ceph tell mon.magna120 heap stats
mon.magna120 tcmalloc heap stats:------------------------------------------------
MALLOC:      563092728 (  537.0 MiB) Bytes in use by application
MALLOC: +   6616170496 ( 6309.7 MiB) Bytes in page heap freelist
MALLOC: +     85302688 (   81.4 MiB) Bytes in central cache freelist
MALLOC: +      8931584 (    8.5 MiB) Bytes in transfer cache freelist
MALLOC: +     56794216 (   54.2 MiB) Bytes in thread cache freelists
MALLOC: +     43253760 (   41.2 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   7373545472 ( 7032.0 MiB) Actual memory used (physical + swap)
MALLOC: +     15990784 (   15.2 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   7389536256 ( 7047.2 MiB) Virtual address space used
MALLOC:
MALLOC:          28394              Spans in use
MALLOC:             24              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
[root@magna120 ~]# ceph tell mon.magna120 heap release
mon.magna120 releasing free RAM back to system.
[root@magna120 ~]# ceph tell mon.magna120 heap stats
mon.magna120 tcmalloc heap stats:------------------------------------------------
MALLOC:      563129616 (  537.0 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +     85320096 (   81.4 MiB) Bytes in central cache freelist
MALLOC: +      8558336 (    8.2 MiB) Bytes in transfer cache freelist
MALLOC: +     57334352 (   54.7 MiB) Bytes in thread cache freelists
MALLOC: +     43253760 (   41.2 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =    757596160 (  722.5 MiB) Actual memory used (physical + swap)
MALLOC: +   6631940096 ( 6324.7 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   7389536256 ( 7047.2 MiB) Virtual address space used
MALLOC:
MALLOC:          28362              Spans in use
MALLOC:             24              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

Mark, have you seen this kind of behavior with tcmalloc before?

Comment 2 Ernesto Puerta 2020-04-20 09:59:02 UTC

@Josh, I also filed this tracker issue (https://tracker.ceph.com/issues/45136) for this same issue, as reported by Preethi. If you have a look at the picture there, between Apr 16 20:30 and Apr 17 04.45 CEST (in the logs: 18.30-02.45 UTC), there's a ramp-up on used memory (excluding cache/buffers/slab) and the mgr logs show 6 warnings about tcmalloc large allocations of 2GB.

Comment 3 Josh Durgin 2020-04-20 19:34:18 UTC

(In reply to Ernesto Puerta from comment #2)
> @Josh, I also filed this tracker issue
> (https://tracker.ceph.com/issues/45136) for this same issue, as reported by
> Preethi. If you have a look at the picture there, between Apr 16 20:30 and
> Apr 17 04.45 CEST (in the logs: 18.30-02.45 UTC), there's a ramp-up on used
> memory (excluding cache/buffers/slab) and the mgr logs show 6 warnings about
> tcmalloc large allocations of 2GB.

Thanks Ernesto, that shows the mgr process was definitely the culprit. Unfortunately abrt deleted the coredump, so we can't see what was using that memory within the process.
I fixed the abrt settings on the machine per https://access.redhat.com/solutions/168603 and increased the abrt storage limit to 100G so this won't happen in the future.

Was anything in particular being tested around this time? Any dashboard functionality, or cli commands that could have impacted the mgr?

Comment 5 Josh Durgin 2020-04-22 21:08:48 UTC

Let's see if this happens again, we'll be able to get a coredump this time at least. If you see the mgr memory usage growing before it crashes, it'd be helpful to increase its debug mgr log level.

Comment 6 Mark Nelson 2020-04-23 03:29:13 UTC

Hi Josh,

It's been a long time since I've seen the mons go crazy with memory usage, and previously it was due to rocksdb compaction not being able to keep up with ingest.  On the OSD side we are now automatically releasing memory periodically in the PriorityCacheManager.  I believe we are using that for the mon now as well right?

Comment 7 Yaniv Kaul 2020-04-23 14:30:00 UTC

(In reply to Josh Durgin from comment #5)
> Let's see if this happens again, we'll be able to get a coredump this time
> at least. If you see the mgr memory usage growing before it crashes, it'd be
> helpful to increase its debug mgr log level.

Who's task is it? NEEDINFO on reporter?

Comment 8 Josh Durgin 2020-04-23 14:48:48 UTC

(In reply to Yaniv Kaul from comment #7)
> (In reply to Josh Durgin from comment #5)
> > Let's see if this happens again, we'll be able to get a coredump this time
> > at least. If you see the mgr memory usage growing before it crashes, it'd be
> > helpful to increase its debug mgr log level.
> 
> Who's task is it? NEEDINFO on reporter?

Yes, reproducing the problem is a task for QE.

If we can reproduce I agree it is a blocker. If we cannot, we won't have enough info to investigate and should move to the 5.* backlog until we can reproduce.