Description of problem:Seeing mgr crash on the latest ceph 4.1 cluster. Version-Release number of selected component (if applicable): [root@magna057 ubuntu]# ceph -v ceph version 14.2.8-28.el7cp (b32eac9fd60c00c62a2d3c85d88b483be7b55ba1) nautilus (stable) How reproducible: Steps to Reproduce: 1. Performed rolling upgrade from 4.0 to 4.1 ceph cluster 2. health status was OK after deployment 3. Ran IOs on OSDs for 30 min 4. System was idle for a day and later we see mgr crash in ceph -s status Actual results: crash is noticed on mgr node i.e magna057 root/q Expected results: No crash should be seen Ceph status: [root@magna057 ubuntu]# ceph -s cluster: id: 4828eef5-70fe-4bab-a5c8-5d3ec27f97ed health: HEALTH_WARN 1 pool(s) have non-power-of-two pg_num application not enabled on 1 pool(s) 1 daemons have recently crashed services: mon: 3 daemons, quorum magna057,magna067,magna120 (age 64m) mgr: magna057(active, since 13h), standbys: magna120 osd: 9 osds: 9 up (since 100m), 9 in (since 100m) rgw: 2 daemons active (magna057.rgw0, magna067.rgw0) task status: data: pools: 5 pools, 132 pgs objects: 116.87k objects, 458 GiB usage: 1.3 TiB used, 6.8 TiB / 8.2 TiB avail pgs: 132 active+clean Additional info: Error snippets: 0> 2020-04-17 02:45:39.029 7fbf5577c700 -1 *** Caught signal (Aborted) ** in thread 7fbf5577c700 thread_name:ms_dispatch ceph version 14.2.8-28.el7cp (b32eac9fd60c00c62a2d3c85d88b483be7b55ba1) nautilus (stable) 1: (()+0xf630) [0x7fbf5ead4630] 2: (gsignal()+0x37) [0x7fbf5d6b2387] 3: (abort()+0x148) [0x7fbf5d6b3a78] 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fbf5dfc27d5] 5: (()+0x5e746) [0x7fbf5dfc0746] 6: (()+0x5e773) [0x7fbf5dfc0773] 7: (()+0x5e993) [0x7fbf5dfc0993] 8: (()+0x1772b) [0x7fbf605dc72b] 9: (tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0xf3) [0x7fbf605faa03] 10: (void std::vector<std::pair<Context*, int>, std::allocator<std::pair<Context*, int> > >::_M_realloc_insert<std::pair<Context*, int> >(__gnu_cxx::__normal_iterator<std::pair<Context*, int>*, std::vector<std::pair<Context*, int>, std::allocator<std::pair<Context*, int> > > >, std::pair<Context*, int>&&)+0x56) [0x5625e6034976] 11: (Finisher::queue(Context*, int)+0xe7) [0x5625e6034b77] 12: (ActivePyModules::notify_all(LogEntry const&)+0x109) [0x5625e602a839] 13: (Mgr::handle_log(MLog*)+0x58) [0x5625e60a1928] 14: (Mgr::ms_dispatch(Message*)+0x22b) [0x5625e60a889b] 15: (MgrStandby::ms_dispatch(Message*)+0x97) [0x5625e60b0157] 16: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x5625e609a916] 17: (DispatchQueue::entry()+0x1699) [0x7fbf61400c29] 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fbf614ae0ad] 19: (()+0x7ea5) [0x7fbf5eaccea5] 20: (clone()+0x6d) [0x7fbf5d77a8dd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore configuration details: root/q [grafana-server] magna122 [mons] magna120 magna067 magna057 [osds] magna025 osd_scenario="lvm" osd_objectstore="bluestore" lvm_volumes="[{'data': '/dev/sdb'},{'data': '/dev/sdc'},{'data': '/dev/sdd'}]" magna023 osd_scenario="lvm" osd_objectstore="bluestore" lvm_volumes="[{'data': '/dev/sdb'},{'data': '/dev/sdc'},{'data': '/dev/sdd'}]" magna021 osd_scenario="lvm" osd_objectstore="bluestore" lvm_volumes="[{'data': '/dev/sdb'},{'data': '/dev/sdc'},{'data': '/dev/sdd'}]" [mgrs] magna120 magna057 [rgws] magna067 magna057 logs can be seen in active mgr node: magna057 /var/log/ceph -> ceph-mgr.magna057.log-20200417 <filename>
That crash indicates the mgr ran out of memory. Looking at the host it's on, magna120, it appears the monitor there is using excessive memory (almost 7GiB), but it is all in tcmalloc's freelist (out of the monitor's control). 'heap release' fixes it: [root@magna120 ~]# ceph tell mon.magna120 heap stats mon.magna120 tcmalloc heap stats:------------------------------------------------ MALLOC: 563092728 ( 537.0 MiB) Bytes in use by application MALLOC: + 6616170496 ( 6309.7 MiB) Bytes in page heap freelist MALLOC: + 85302688 ( 81.4 MiB) Bytes in central cache freelist MALLOC: + 8931584 ( 8.5 MiB) Bytes in transfer cache freelist MALLOC: + 56794216 ( 54.2 MiB) Bytes in thread cache freelists MALLOC: + 43253760 ( 41.2 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 7373545472 ( 7032.0 MiB) Actual memory used (physical + swap) MALLOC: + 15990784 ( 15.2 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 7389536256 ( 7047.2 MiB) Virtual address space used MALLOC: MALLOC: 28394 Spans in use MALLOC: 24 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory. [root@magna120 ~]# ceph tell mon.magna120 heap release mon.magna120 releasing free RAM back to system. [root@magna120 ~]# ceph tell mon.magna120 heap stats mon.magna120 tcmalloc heap stats:------------------------------------------------ MALLOC: 563129616 ( 537.0 MiB) Bytes in use by application MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist MALLOC: + 85320096 ( 81.4 MiB) Bytes in central cache freelist MALLOC: + 8558336 ( 8.2 MiB) Bytes in transfer cache freelist MALLOC: + 57334352 ( 54.7 MiB) Bytes in thread cache freelists MALLOC: + 43253760 ( 41.2 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 757596160 ( 722.5 MiB) Actual memory used (physical + swap) MALLOC: + 6631940096 ( 6324.7 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 7389536256 ( 7047.2 MiB) Virtual address space used MALLOC: MALLOC: 28362 Spans in use MALLOC: 24 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory. Mark, have you seen this kind of behavior with tcmalloc before?
@Josh, I also filed this tracker issue (https://tracker.ceph.com/issues/45136) for this same issue, as reported by Preethi. If you have a look at the picture there, between Apr 16 20:30 and Apr 17 04.45 CEST (in the logs: 18.30-02.45 UTC), there's a ramp-up on used memory (excluding cache/buffers/slab) and the mgr logs show 6 warnings about tcmalloc large allocations of 2GB.
(In reply to Ernesto Puerta from comment #2) > @Josh, I also filed this tracker issue > (https://tracker.ceph.com/issues/45136) for this same issue, as reported by > Preethi. If you have a look at the picture there, between Apr 16 20:30 and > Apr 17 04.45 CEST (in the logs: 18.30-02.45 UTC), there's a ramp-up on used > memory (excluding cache/buffers/slab) and the mgr logs show 6 warnings about > tcmalloc large allocations of 2GB. Thanks Ernesto, that shows the mgr process was definitely the culprit. Unfortunately abrt deleted the coredump, so we can't see what was using that memory within the process. I fixed the abrt settings on the machine per https://access.redhat.com/solutions/168603 and increased the abrt storage limit to 100G so this won't happen in the future. Was anything in particular being tested around this time? Any dashboard functionality, or cli commands that could have impacted the mgr?
Let's see if this happens again, we'll be able to get a coredump this time at least. If you see the mgr memory usage growing before it crashes, it'd be helpful to increase its debug mgr log level.
Hi Josh, It's been a long time since I've seen the mons go crazy with memory usage, and previously it was due to rocksdb compaction not being able to keep up with ingest. On the OSD side we are now automatically releasing memory periodically in the PriorityCacheManager. I believe we are using that for the mon now as well right?
(In reply to Josh Durgin from comment #5) > Let's see if this happens again, we'll be able to get a coredump this time > at least. If you see the mgr memory usage growing before it crashes, it'd be > helpful to increase its debug mgr log level. Who's task is it? NEEDINFO on reporter?
(In reply to Yaniv Kaul from comment #7) > (In reply to Josh Durgin from comment #5) > > Let's see if this happens again, we'll be able to get a coredump this time > > at least. If you see the mgr memory usage growing before it crashes, it'd be > > helpful to increase its debug mgr log level. > > Who's task is it? NEEDINFO on reporter? Yes, reproducing the problem is a task for QE. If we can reproduce I agree it is a blocker. If we cannot, we won't have enough info to investigate and should move to the 5.* backlog until we can reproduce.