Bug 2021931

Summary: [Ceph Tracker bug #2185532] [DR] OSD crash with OOM when removing data
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Elvir Kuric <ekuric>
Component: cephAssignee: Neha Ojha <nojha>
ceph sub component: RADOS QA Contact: Elad <ebenahar>
Status: CLOSED WORKSFORME Docs Contact:
Severity: high    
Priority: unspecified CC: bniver, ebenahar, jdurgin, jespy, kramdoss, kseeger, mmuench, muagarwa, nojha, odf-bz-bot, pdhange, prsurve, rsussman, shberry, sostapov, vumrao
Version: 4.9Keywords: AutomationBackLog, Performance
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2185532 (view as bug list) Environment:
Last Closed: 2023-08-14 07:57:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2185532    
Bug Blocks:    

Comment 4 Mudit Agarwal 2021-11-15 08:18:09 UTC
Not a proposed 4.9 blocker, moving it out.

Scott, can someone please take a look.

Comment 5 Scott Ostapovicz 2021-11-15 14:23:34 UTC
Travis please take a quick look.

Comment 6 Travis Nielsen 2021-11-15 18:53:32 UTC
In the OSD logs [1] I see lots of messages about the write buffer being full. 

debug 2021-11-10T12:16:47.789+0000 7f5f50da6700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1636546607790955, "job": 34, "event": "flush_started", "num_memtables": 1, "num_entries": 5325, "num_deletes": 1767, 
"total_data_size": 259268310, "memory_usage": 260140856, "flush_reason": "Write Buffer Full"}

Looks like a question for core ceph around the deletions handling.

Comment 10 Mark Nelson 2021-11-16 17:55:05 UTC
Hi Elvir,

Can you dump the mempools from one of the OSDs that using more memory than the osd_memory_target?  There's a recent bug we saw on the mailing list where the pglog length grew excessively:

https://www.spinics.net/lists/ceph-users/msg69599.html



If the mempools don't tell us anything useful, it may also be worth enabling debug bluestore = 5 and debug prioritycache = 5 to make sure the cache autotuning is functioning properly (it always has been in the past though).

Comment 11 Elvir Kuric 2021-11-17 08:04:08 UTC
ceph crash ls
ID                                                                ENTITY  NEW  
2021-11-15T19:45:16.701285Z_8211822e-d0b9-4f04-963e-2e239a0c843f  osd.2    *   
2021-11-15T23:40:59.464430Z_232ea6fe-20fe-46a7-978c-78cd650a1274  osd.5    *   
2021-11-16T02:24:18.108029Z_ae6b9b02-a362-4ef9-a834-32503bc9a2a2  osd.5    *   
2021-11-16T05:27:28.550505Z_ece08349-efd3-4ee5-aaf1-2f95fbd42872  osd.2    *   
2021-11-16T09:26:47.048294Z_920d9273-46d5-42db-9c4e-795aadec9ae0  mon.a    *   
sh-4.4$ for z in $(ceph crash ls |grep -v ID  |awk '{print $1}'); do echo "Crash ------- $z ---------- Crash" ;  ceph crash info $z; done 
Crash ------- 2021-11-15T19:45:16.701285Z_8211822e-d0b9-4f04-963e-2e239a0c843f ---------- Crash
{
    "assert_condition": "abort",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc",
    "assert_func": "void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)",
    "assert_line": 13122,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7ff47866a700 time 2021-11-15T19:45:16.692129+0000\n/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc: 13122: ceph_abort_msg(\"unexpected error\")\n",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7ff499762b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x56530fac64d1]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x1507) [0x56531010bde7]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x56531010dc96]",
        "(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x85) [0x56530fc31305]",
        "(non-virtual thunk to PrimaryLogPG::queue_transaction(ceph::os::Transaction&&, boost::intrusive_ptr<OpRequest>)+0x53) [0x56530fd6a643]",
        "(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x472) [0x56530ff656e2]",
        "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a8) [0x56530ff675a8]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x56530fd9d802]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x56530fd409be]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x56530fbca549]",
        "(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x56530fe26fa8]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x56530fbea508]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x565310251934]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5653102545d4]",
        "/lib64/libpthread.so.0(+0x814a) [0x7ff49975814a]",
        "clone()"
    ],
    "ceph_version": "16.2.0-143.el8cp",
    "crash_id": "2021-11-15T19:45:16.701285Z_8211822e-d0b9-4f04-963e-2e239a0c843f",
    "entity_name": "osd.2",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.4 (Ootpa)",
    "os_version_id": "8.4",
    "process_name": "ceph-osd",
    "stack_sig": "fb66b70c75e7efa0b1494766a0622afe6f862679538a4cad0f264ca51e71da42",
    "timestamp": "2021-11-15T19:45:16.701285Z",
    "utsname_hostname": "rook-ceph-osd-2-6d8d985544-kw4d2",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}
Crash ------- 2021-11-15T23:40:59.464430Z_232ea6fe-20fe-46a7-978c-78cd650a1274 ---------- Crash
{
    "assert_condition": "abort",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc",
    "assert_func": "void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)",
    "assert_line": 13122,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f9ae92e8700 time 2021-11-15T23:40:59.452631+0000\n/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc: 13122: ceph_abort_msg(\"unexpected error\")\n",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f9b103ecb20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x556e17dd54d1]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x1507) [0x556e1841ade7]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x556e1841cc96]",
        "(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x85) [0x556e17f40305]",
        "(non-virtual thunk to PrimaryLogPG::queue_transaction(ceph::os::Transaction&&, boost::intrusive_ptr<OpRequest>)+0x53) [0x556e18079643]",
        "(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x472) [0x556e182746e2]",
        "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a8) [0x556e182765a8]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x556e180ac802]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x556e1804f9be]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x556e17ed9549]",
        "(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x556e18135fa8]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x556e17ef9508]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x556e18560934]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x556e185635d4]",
        "/lib64/libpthread.so.0(+0x814a) [0x7f9b103e214a]",
        "clone()"
    ],
    "ceph_version": "16.2.0-143.el8cp",
    "crash_id": "2021-11-15T23:40:59.464430Z_232ea6fe-20fe-46a7-978c-78cd650a1274",
    "entity_name": "osd.5",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.4 (Ootpa)",
    "os_version_id": "8.4",
    "process_name": "ceph-osd",
    "stack_sig": "fb66b70c75e7efa0b1494766a0622afe6f862679538a4cad0f264ca51e71da42",
    "timestamp": "2021-11-15T23:40:59.464430Z",
    "utsname_hostname": "rook-ceph-osd-5-7cf577f67c-fswjn",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}
Crash ------- 2021-11-16T02:24:18.108029Z_ae6b9b02-a362-4ef9-a834-32503bc9a2a2 ---------- Crash
{
    "assert_condition": "is_primary()",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/osd/PrimaryLogPG.cc",
    "assert_func": "virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)",
    "assert_line": 413,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f3f29ccf700 time 2021-11-16T02:24:18.099792+0000\n/builddir/build/BUILD/ceph-16.2.0/src/osd/PrimaryLogPG.cc: 413: FAILED ceph_assert(is_primary())\n",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f3f4f5d0b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x558b11833d11]",
        "ceph-osd(+0x568eda) [0x558b11833eda]",
        "(PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x332) [0x558b11a2d5a2]",
        "(ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3bd) [0x558b11cd307d]",
        "(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x24b) [0x558b11cd34bb]",
        "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a8) [0x558b11cd55a8]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x558b11b0b802]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x558b11aae9be]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x558b11938549]",
        "(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x558b11b94fa8]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x558b11958508]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x558b11fbf934]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x558b11fc25d4]",
        "/lib64/libpthread.so.0(+0x814a) [0x7f3f4f5c614a]",
        "clone()"
    ],
    "ceph_version": "16.2.0-143.el8cp",
    "crash_id": "2021-11-16T02:24:18.108029Z_ae6b9b02-a362-4ef9-a834-32503bc9a2a2",
    "entity_name": "osd.5",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.4 (Ootpa)",
    "os_version_id": "8.4",
    "process_name": "ceph-osd",
    "stack_sig": "73067413339673026a198490b3316de6c6afc1e0280a3d04181ed63198708efb",
    "timestamp": "2021-11-16T02:24:18.108029Z",
    "utsname_hostname": "rook-ceph-osd-5-7cf577f67c-fswjn",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}
Crash ------- 2021-11-16T05:27:28.550505Z_ece08349-efd3-4ee5-aaf1-2f95fbd42872 ---------- Crash
{
    "assert_condition": "p->second.need <= v || p->second.is_delete()",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/osd/osd_types.h",
    "assert_func": "void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]",
    "assert_line": 4910,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/osd/osd_types.h: In function 'void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]' thread 7fbdd920e700 time 2021-11-16T05:27:28.471737+0000\n/builddir/build/BUILD/ceph-16.2.0/src/osd/osd_types.h: 4910: FAILED ceph_assert(p->second.need <= v || p->second.is_delete())\n",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7fbdfe30eb20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x556252216d11]",
        "ceph-osd(+0x568eda) [0x556252216eda]",
        "(PeeringState::on_peer_recover(pg_shard_t, hobject_t const&, eversion_t const&)+0x1b4) [0x556252591264]",
        "(ReplicatedBackend::handle_push_reply(pg_shard_t, PushReplyOp const&, PushOp*)+0x585) [0x5562526b0d35]",
        "(ReplicatedBackend::do_push_reply(boost::intrusive_ptr<OpRequest>)+0x101) [0x5562526b39c1]",
        "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x237) [0x5562526b8537]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x5562524ee802]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x5562524919be]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55625231b549]",
        "(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x556252577fa8]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x55625233b508]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5562529a2934]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5562529a55d4]",
        "/lib64/libpthread.so.0(+0x814a) [0x7fbdfe30414a]",
        "clone()"
    ],
    "ceph_version": "16.2.0-143.el8cp",
    "crash_id": "2021-11-16T05:27:28.550505Z_ece08349-efd3-4ee5-aaf1-2f95fbd42872",
    "entity_name": "osd.2",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.4 (Ootpa)",
    "os_version_id": "8.4",
    "process_name": "ceph-osd",
    "stack_sig": "85753395395b5d23971763812d6c1b4ed9a1fdbb2c56a90ae31a8ecf61472de5",
    "timestamp": "2021-11-16T05:27:28.550505Z",
    "utsname_hostname": "rook-ceph-osd-2-6d8d985544-kw4d2",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}
Crash ------- 2021-11-16T09:26:47.048294Z_920d9273-46d5-42db-9c4e-795aadec9ae0 ---------- Crash
{
    "assert_condition": "(sharded_in_flight_list.back())->ops_in_flight_sharded.empty()",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/common/TrackedOp.cc",
    "assert_func": "OpTracker::~OpTracker()",
    "assert_line": 173,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/common/TrackedOp.cc: In function 'OpTracker::~OpTracker()' thread 7f84ab877700 time 2021-11-16T09:26:47.045591+0000\n/builddir/build/BUILD/ceph-16.2.0/src/common/TrackedOp.cc: 173: FAILED ceph_assert((sharded_in_flight_list.back())->ops_in_flight_sharded.empty())\n",
    "assert_thread_name": "ceph-mon",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f84a0508b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f84a2a0a5f1]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x2767ba) [0x7f84a2a0a7ba]",
        "(OpTracker::~OpTracker()+0x39) [0x7f84a2affa99]",
        "(Monitor::~Monitor()+0xac) [0x5612e47be6cc]",
        "(Monitor::~Monitor()+0xd) [0x5612e47bf1ad]",
        "main()",
        "__libc_start_main()",
        "_start()"
    ],
    "ceph_version": "16.2.0-143.el8cp",
    "crash_id": "2021-11-16T09:26:47.048294Z_920d9273-46d5-42db-9c4e-795aadec9ae0",
    "entity_name": "mon.a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.4 (Ootpa)",
    "os_version_id": "8.4",
    "process_name": "ceph-mon",
    "stack_sig": "76e590fa2428cd05276f0da5e7a237b294c18401786d06e48f0176f55d44ba60",
    "timestamp": "2021-11-16T09:26:47.048294Z",
    "utsname_hostname": "rook-ceph-mon-a-547c944c66-ghmqh",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}

Comment 12 Scott Ostapovicz 2021-11-17 15:26:54 UTC
@mnelson would you please confirm that it is (or is not) a pglog problem.

Comment 13 Elvir Kuric 2021-11-17 15:30:54 UTC
some OSDs crashed , as soon I was able to execute to them "oc rsh osd-pod" ... got below for ceph daemon ...dump_mempools 


---- 

ceph daemon osd.3 dump_mempools
{
    "mempool": {
        "by_pool": {
            "bloom_filter": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_alloc": {
                "items": 5158,
                "bytes": 412640
            },
            "bluestore_cache_data": {
                "items": 1339,
                "bytes": 62044609
            },
            "bluestore_cache_onode": {
                "items": 2069,
                "bytes": 1274504
            },
            "bluestore_cache_meta": {
                "items": 176556,
                "bytes": 1610916
            },
            "bluestore_cache_other": {
                "items": 122512,
                "bytes": 5575784
            },
            "bluestore_Buffer": {
                "items": 1318,
                "bytes": 126528
            },
            "bluestore_Extent": {
                "items": 35830,
                "bytes": 1719840
            },
            "bluestore_Blob": {
                "items": 35774,
                "bytes": 3720496
            },
            "bluestore_SharedBlob": {
                "items": 25468,
                "bytes": 2852416
            },
            "bluestore_inline_bl": {
                "items": 957,
                "bytes": 325454
            },
            "bluestore_fsck": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_txc": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_writing_deferred": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_writing": {
                "items": 68,
                "bytes": 14086280
            },
            "bluefs": {
                "items": 1341,
                "bytes": 42704
            },
            "bluefs_file_reader": {
                "items": 166,
                "bytes": 596352
            },
            "bluefs_file_writer": {
                "items": 3,
                "bytes": 576
            },
            "buffer_anon": {
                "items": 4707,
                "bytes": 257878260
            },
            "buffer_meta": {
                "items": 2897,
                "bytes": 254936
            },
            "osd": {
                "items": 90,
                "bytes": 1018080
            },
            "osd_mapbl": {
                "items": 0,
                "bytes": 0
            },
            "osd_pglog": {
                "items": 126766,
                "bytes": 65270832
            },
            "osdmap": {
                "items": 2034608,
                "bytes": 32809328
            },
            "osdmap_mapping": {
                "items": 0,
                "bytes": 0
            },
            "pgmap": {
                "items": 0,
                "bytes": 0
            },
            "mds_co": {
                "items": 0,
                "bytes": 0
            },
            "unittest_1": {
                "items": 0,
                "bytes": 0
            },
            "unittest_2": {
                "items": 0,
                "bytes": 0
            }
        },
        "total": {
            "items": 2577627,
            "bytes": 451620535
        }
    }
}




---- 

 unset CEPH_ARGS
sh-4.4# ceph daemon osd.6 dump_mempools
{
    "mempool": {
        "by_pool": {
            "bloom_filter": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_alloc": {
                "items": 5836,
                "bytes": 466880
            },
            "bluestore_cache_data": {
                "items": 271,
                "bytes": 4891083
            },
            "bluestore_cache_onode": {
                "items": 422,
                "bytes": 259952
            },
            "bluestore_cache_meta": {
                "items": 24743,
                "bytes": 178451
            },
            "bluestore_cache_other": {
                "items": 9739,
                "bytes": 430224
            },
            "bluestore_Buffer": {
                "items": 133,
                "bytes": 12768
            },
            "bluestore_Extent": {
                "items": 2934,
                "bytes": 140832
            },
            "bluestore_Blob": {
                "items": 2934,
                "bytes": 305136
            },
            "bluestore_SharedBlob": {
                "items": 2475,
                "bytes": 277200
            },
            "bluestore_inline_bl": {
                "items": 192,
                "bytes": 94434
            },
            "bluestore_fsck": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_txc": {
                "items": 1,
                "bytes": 784
            },
            "bluestore_writing_deferred": {
                "items": 4,
                "bytes": 20096
            },
            "bluestore_writing": {
                "items": 57,
                "bytes": 21594347
            },
            "bluefs": {
                "items": 757,
                "bytes": 18360
            },
            "bluefs_file_reader": {
                "items": 47,
                "bytes": 593024
            },
            "bluefs_file_writer": {
                "items": 3,
                "bytes": 576
            },
            "buffer_anon": {
                "items": 324,
                "bytes": 10723644
            },
            "buffer_meta": {
                "items": 212,
                "bytes": 18656
            },
            "osd": {
                "items": 98,
                "bytes": 1063584
            },
            "osd_mapbl": {
                "items": 0,
                "bytes": 0
            },
            "osd_pglog": {
                "items": 143921,
                "bytes": 74269128
            },
            "osdmap": {
                "items": 2041877,
                "bytes": 32926004
            },
            "osdmap_mapping": {
                "items": 0,
                "bytes": 0
            },
            "pgmap": {
                "items": 0,
                "bytes": 0
            },
            "mds_co": {
                "items": 0,
                "bytes": 0
            },
            "unittest_1": {
                "items": 0,
                "bytes": 0
            },
            "unittest_2": {
                "items": 0,
                "bytes": 0
            }
        },
        "total": {
            "items": 2236980,
            "bytes": 148285163
        }
    }
}
sh-4.4#

Comment 16 Mudit Agarwal 2022-01-11 11:34:47 UTC
Removing 4.9.z, not a TP blocker. Exists since long back and still needs investigation and QE input.

Comment 18 Mark Nelson 2022-02-16 13:03:13 UTC
@ekuric Ok, looking at those results it doesn't appear that the WAL buffers in rocksdb are backing up imho.  Josh Durgin mentioned that given we are seeing this with RBD mirroring, it may be related to snapshot trimming.

I went back and looked through an old thread on the user's mailing list.  Frank Schilder also saw memory growth seemingly due to snap trimming:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/TPIFMPQ6YHEK4GYH5LA6NWGRFXVW44MB/

He gathered tcmalloc heap profiles and believes he may have observed a memory leak of decoded data:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/TPIFMPQ6YHEK4GYH5LA6NWGRFXVW44MB/

It probably means we need to see if we allocate memory to decode some data structure from rocksdb related to snap trimming somewhere any never actually free it.

Comment 19 Mark Nelson 2022-02-16 16:04:11 UTC
Wrote that a little fast and can't edit, ignore the duplicate link please. :)

Comment 20 Josh Durgin 2022-02-16 16:08:40 UTC
Elvir can you get more detailed memory allocation info as Mark suggests in comment#15?

Comment 24 Scott Ostapovicz 2022-03-30 17:47:21 UTC
*** Bug 2069753 has been marked as a duplicate of this bug. ***

Comment 26 Mudit Agarwal 2022-07-05 13:13:52 UTC
Not a 4.11 blocker