Bug 2021931 - [Ceph Tracker bug #2185532] [DR] OSD crash with OOM when removing data
Summary: [Ceph Tracker bug #2185532] [DR] OSD crash with OOM when removing data
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Neha Ojha
QA Contact: Elad
URL:
Whiteboard:
: 2069753 (view as bug list)
Depends On: 2185532
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-10 13:01 UTC by Elvir Kuric
Modified: 2023-08-14 07:57 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2185532 (view as bug list)
Environment:
Last Closed: 2023-08-14 07:57:15 UTC
Embargoed:


Attachments (Terms of Use)

Comment 4 Mudit Agarwal 2021-11-15 08:18:09 UTC
Not a proposed 4.9 blocker, moving it out.

Scott, can someone please take a look.

Comment 5 Scott Ostapovicz 2021-11-15 14:23:34 UTC
Travis please take a quick look.

Comment 6 Travis Nielsen 2021-11-15 18:53:32 UTC
In the OSD logs [1] I see lots of messages about the write buffer being full. 

debug 2021-11-10T12:16:47.789+0000 7f5f50da6700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1636546607790955, "job": 34, "event": "flush_started", "num_memtables": 1, "num_entries": 5325, "num_deletes": 1767, 
"total_data_size": 259268310, "memory_usage": 260140856, "flush_reason": "Write Buffer Full"}

Looks like a question for core ceph around the deletions handling.

Comment 10 Mark Nelson 2021-11-16 17:55:05 UTC
Hi Elvir,

Can you dump the mempools from one of the OSDs that using more memory than the osd_memory_target?  There's a recent bug we saw on the mailing list where the pglog length grew excessively:

https://www.spinics.net/lists/ceph-users/msg69599.html



If the mempools don't tell us anything useful, it may also be worth enabling debug bluestore = 5 and debug prioritycache = 5 to make sure the cache autotuning is functioning properly (it always has been in the past though).

Comment 11 Elvir Kuric 2021-11-17 08:04:08 UTC
ceph crash ls
ID                                                                ENTITY  NEW  
2021-11-15T19:45:16.701285Z_8211822e-d0b9-4f04-963e-2e239a0c843f  osd.2    *   
2021-11-15T23:40:59.464430Z_232ea6fe-20fe-46a7-978c-78cd650a1274  osd.5    *   
2021-11-16T02:24:18.108029Z_ae6b9b02-a362-4ef9-a834-32503bc9a2a2  osd.5    *   
2021-11-16T05:27:28.550505Z_ece08349-efd3-4ee5-aaf1-2f95fbd42872  osd.2    *   
2021-11-16T09:26:47.048294Z_920d9273-46d5-42db-9c4e-795aadec9ae0  mon.a    *   
sh-4.4$ for z in $(ceph crash ls |grep -v ID  |awk '{print $1}'); do echo "Crash ------- $z ---------- Crash" ;  ceph crash info $z; done 
Crash ------- 2021-11-15T19:45:16.701285Z_8211822e-d0b9-4f04-963e-2e239a0c843f ---------- Crash
{
    "assert_condition": "abort",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc",
    "assert_func": "void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)",
    "assert_line": 13122,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7ff47866a700 time 2021-11-15T19:45:16.692129+0000\n/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc: 13122: ceph_abort_msg(\"unexpected error\")\n",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7ff499762b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x56530fac64d1]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x1507) [0x56531010bde7]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x56531010dc96]",
        "(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x85) [0x56530fc31305]",
        "(non-virtual thunk to PrimaryLogPG::queue_transaction(ceph::os::Transaction&&, boost::intrusive_ptr<OpRequest>)+0x53) [0x56530fd6a643]",
        "(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x472) [0x56530ff656e2]",
        "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a8) [0x56530ff675a8]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x56530fd9d802]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x56530fd409be]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x56530fbca549]",
        "(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x56530fe26fa8]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x56530fbea508]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x565310251934]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5653102545d4]",
        "/lib64/libpthread.so.0(+0x814a) [0x7ff49975814a]",
        "clone()"
    ],
    "ceph_version": "16.2.0-143.el8cp",
    "crash_id": "2021-11-15T19:45:16.701285Z_8211822e-d0b9-4f04-963e-2e239a0c843f",
    "entity_name": "osd.2",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.4 (Ootpa)",
    "os_version_id": "8.4",
    "process_name": "ceph-osd",
    "stack_sig": "fb66b70c75e7efa0b1494766a0622afe6f862679538a4cad0f264ca51e71da42",
    "timestamp": "2021-11-15T19:45:16.701285Z",
    "utsname_hostname": "rook-ceph-osd-2-6d8d985544-kw4d2",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}
Crash ------- 2021-11-15T23:40:59.464430Z_232ea6fe-20fe-46a7-978c-78cd650a1274 ---------- Crash
{
    "assert_condition": "abort",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc",
    "assert_func": "void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)",
    "assert_line": 13122,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f9ae92e8700 time 2021-11-15T23:40:59.452631+0000\n/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc: 13122: ceph_abort_msg(\"unexpected error\")\n",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f9b103ecb20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x556e17dd54d1]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x1507) [0x556e1841ade7]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x556e1841cc96]",
        "(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x85) [0x556e17f40305]",
        "(non-virtual thunk to PrimaryLogPG::queue_transaction(ceph::os::Transaction&&, boost::intrusive_ptr<OpRequest>)+0x53) [0x556e18079643]",
        "(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x472) [0x556e182746e2]",
        "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a8) [0x556e182765a8]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x556e180ac802]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x556e1804f9be]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x556e17ed9549]",
        "(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x556e18135fa8]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x556e17ef9508]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x556e18560934]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x556e185635d4]",
        "/lib64/libpthread.so.0(+0x814a) [0x7f9b103e214a]",
        "clone()"
    ],
    "ceph_version": "16.2.0-143.el8cp",
    "crash_id": "2021-11-15T23:40:59.464430Z_232ea6fe-20fe-46a7-978c-78cd650a1274",
    "entity_name": "osd.5",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.4 (Ootpa)",
    "os_version_id": "8.4",
    "process_name": "ceph-osd",
    "stack_sig": "fb66b70c75e7efa0b1494766a0622afe6f862679538a4cad0f264ca51e71da42",
    "timestamp": "2021-11-15T23:40:59.464430Z",
    "utsname_hostname": "rook-ceph-osd-5-7cf577f67c-fswjn",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}
Crash ------- 2021-11-16T02:24:18.108029Z_ae6b9b02-a362-4ef9-a834-32503bc9a2a2 ---------- Crash
{
    "assert_condition": "is_primary()",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/osd/PrimaryLogPG.cc",
    "assert_func": "virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)",
    "assert_line": 413,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f3f29ccf700 time 2021-11-16T02:24:18.099792+0000\n/builddir/build/BUILD/ceph-16.2.0/src/osd/PrimaryLogPG.cc: 413: FAILED ceph_assert(is_primary())\n",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f3f4f5d0b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x558b11833d11]",
        "ceph-osd(+0x568eda) [0x558b11833eda]",
        "(PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x332) [0x558b11a2d5a2]",
        "(ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3bd) [0x558b11cd307d]",
        "(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x24b) [0x558b11cd34bb]",
        "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a8) [0x558b11cd55a8]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x558b11b0b802]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x558b11aae9be]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x558b11938549]",
        "(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x558b11b94fa8]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x558b11958508]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x558b11fbf934]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x558b11fc25d4]",
        "/lib64/libpthread.so.0(+0x814a) [0x7f3f4f5c614a]",
        "clone()"
    ],
    "ceph_version": "16.2.0-143.el8cp",
    "crash_id": "2021-11-16T02:24:18.108029Z_ae6b9b02-a362-4ef9-a834-32503bc9a2a2",
    "entity_name": "osd.5",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.4 (Ootpa)",
    "os_version_id": "8.4",
    "process_name": "ceph-osd",
    "stack_sig": "73067413339673026a198490b3316de6c6afc1e0280a3d04181ed63198708efb",
    "timestamp": "2021-11-16T02:24:18.108029Z",
    "utsname_hostname": "rook-ceph-osd-5-7cf577f67c-fswjn",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}
Crash ------- 2021-11-16T05:27:28.550505Z_ece08349-efd3-4ee5-aaf1-2f95fbd42872 ---------- Crash
{
    "assert_condition": "p->second.need <= v || p->second.is_delete()",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/osd/osd_types.h",
    "assert_func": "void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]",
    "assert_line": 4910,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/osd/osd_types.h: In function 'void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]' thread 7fbdd920e700 time 2021-11-16T05:27:28.471737+0000\n/builddir/build/BUILD/ceph-16.2.0/src/osd/osd_types.h: 4910: FAILED ceph_assert(p->second.need <= v || p->second.is_delete())\n",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7fbdfe30eb20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x556252216d11]",
        "ceph-osd(+0x568eda) [0x556252216eda]",
        "(PeeringState::on_peer_recover(pg_shard_t, hobject_t const&, eversion_t const&)+0x1b4) [0x556252591264]",
        "(ReplicatedBackend::handle_push_reply(pg_shard_t, PushReplyOp const&, PushOp*)+0x585) [0x5562526b0d35]",
        "(ReplicatedBackend::do_push_reply(boost::intrusive_ptr<OpRequest>)+0x101) [0x5562526b39c1]",
        "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x237) [0x5562526b8537]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x5562524ee802]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x5562524919be]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55625231b549]",
        "(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x556252577fa8]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x55625233b508]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5562529a2934]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5562529a55d4]",
        "/lib64/libpthread.so.0(+0x814a) [0x7fbdfe30414a]",
        "clone()"
    ],
    "ceph_version": "16.2.0-143.el8cp",
    "crash_id": "2021-11-16T05:27:28.550505Z_ece08349-efd3-4ee5-aaf1-2f95fbd42872",
    "entity_name": "osd.2",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.4 (Ootpa)",
    "os_version_id": "8.4",
    "process_name": "ceph-osd",
    "stack_sig": "85753395395b5d23971763812d6c1b4ed9a1fdbb2c56a90ae31a8ecf61472de5",
    "timestamp": "2021-11-16T05:27:28.550505Z",
    "utsname_hostname": "rook-ceph-osd-2-6d8d985544-kw4d2",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}
Crash ------- 2021-11-16T09:26:47.048294Z_920d9273-46d5-42db-9c4e-795aadec9ae0 ---------- Crash
{
    "assert_condition": "(sharded_in_flight_list.back())->ops_in_flight_sharded.empty()",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/common/TrackedOp.cc",
    "assert_func": "OpTracker::~OpTracker()",
    "assert_line": 173,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/common/TrackedOp.cc: In function 'OpTracker::~OpTracker()' thread 7f84ab877700 time 2021-11-16T09:26:47.045591+0000\n/builddir/build/BUILD/ceph-16.2.0/src/common/TrackedOp.cc: 173: FAILED ceph_assert((sharded_in_flight_list.back())->ops_in_flight_sharded.empty())\n",
    "assert_thread_name": "ceph-mon",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f84a0508b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f84a2a0a5f1]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x2767ba) [0x7f84a2a0a7ba]",
        "(OpTracker::~OpTracker()+0x39) [0x7f84a2affa99]",
        "(Monitor::~Monitor()+0xac) [0x5612e47be6cc]",
        "(Monitor::~Monitor()+0xd) [0x5612e47bf1ad]",
        "main()",
        "__libc_start_main()",
        "_start()"
    ],
    "ceph_version": "16.2.0-143.el8cp",
    "crash_id": "2021-11-16T09:26:47.048294Z_920d9273-46d5-42db-9c4e-795aadec9ae0",
    "entity_name": "mon.a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.4 (Ootpa)",
    "os_version_id": "8.4",
    "process_name": "ceph-mon",
    "stack_sig": "76e590fa2428cd05276f0da5e7a237b294c18401786d06e48f0176f55d44ba60",
    "timestamp": "2021-11-16T09:26:47.048294Z",
    "utsname_hostname": "rook-ceph-mon-a-547c944c66-ghmqh",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}

Comment 12 Scott Ostapovicz 2021-11-17 15:26:54 UTC
@mnelson would you please confirm that it is (or is not) a pglog problem.

Comment 13 Elvir Kuric 2021-11-17 15:30:54 UTC
some OSDs crashed , as soon I was able to execute to them "oc rsh osd-pod" ... got below for ceph daemon ...dump_mempools 


---- 

ceph daemon osd.3 dump_mempools
{
    "mempool": {
        "by_pool": {
            "bloom_filter": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_alloc": {
                "items": 5158,
                "bytes": 412640
            },
            "bluestore_cache_data": {
                "items": 1339,
                "bytes": 62044609
            },
            "bluestore_cache_onode": {
                "items": 2069,
                "bytes": 1274504
            },
            "bluestore_cache_meta": {
                "items": 176556,
                "bytes": 1610916
            },
            "bluestore_cache_other": {
                "items": 122512,
                "bytes": 5575784
            },
            "bluestore_Buffer": {
                "items": 1318,
                "bytes": 126528
            },
            "bluestore_Extent": {
                "items": 35830,
                "bytes": 1719840
            },
            "bluestore_Blob": {
                "items": 35774,
                "bytes": 3720496
            },
            "bluestore_SharedBlob": {
                "items": 25468,
                "bytes": 2852416
            },
            "bluestore_inline_bl": {
                "items": 957,
                "bytes": 325454
            },
            "bluestore_fsck": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_txc": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_writing_deferred": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_writing": {
                "items": 68,
                "bytes": 14086280
            },
            "bluefs": {
                "items": 1341,
                "bytes": 42704
            },
            "bluefs_file_reader": {
                "items": 166,
                "bytes": 596352
            },
            "bluefs_file_writer": {
                "items": 3,
                "bytes": 576
            },
            "buffer_anon": {
                "items": 4707,
                "bytes": 257878260
            },
            "buffer_meta": {
                "items": 2897,
                "bytes": 254936
            },
            "osd": {
                "items": 90,
                "bytes": 1018080
            },
            "osd_mapbl": {
                "items": 0,
                "bytes": 0
            },
            "osd_pglog": {
                "items": 126766,
                "bytes": 65270832
            },
            "osdmap": {
                "items": 2034608,
                "bytes": 32809328
            },
            "osdmap_mapping": {
                "items": 0,
                "bytes": 0
            },
            "pgmap": {
                "items": 0,
                "bytes": 0
            },
            "mds_co": {
                "items": 0,
                "bytes": 0
            },
            "unittest_1": {
                "items": 0,
                "bytes": 0
            },
            "unittest_2": {
                "items": 0,
                "bytes": 0
            }
        },
        "total": {
            "items": 2577627,
            "bytes": 451620535
        }
    }
}




---- 

 unset CEPH_ARGS
sh-4.4# ceph daemon osd.6 dump_mempools
{
    "mempool": {
        "by_pool": {
            "bloom_filter": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_alloc": {
                "items": 5836,
                "bytes": 466880
            },
            "bluestore_cache_data": {
                "items": 271,
                "bytes": 4891083
            },
            "bluestore_cache_onode": {
                "items": 422,
                "bytes": 259952
            },
            "bluestore_cache_meta": {
                "items": 24743,
                "bytes": 178451
            },
            "bluestore_cache_other": {
                "items": 9739,
                "bytes": 430224
            },
            "bluestore_Buffer": {
                "items": 133,
                "bytes": 12768
            },
            "bluestore_Extent": {
                "items": 2934,
                "bytes": 140832
            },
            "bluestore_Blob": {
                "items": 2934,
                "bytes": 305136
            },
            "bluestore_SharedBlob": {
                "items": 2475,
                "bytes": 277200
            },
            "bluestore_inline_bl": {
                "items": 192,
                "bytes": 94434
            },
            "bluestore_fsck": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_txc": {
                "items": 1,
                "bytes": 784
            },
            "bluestore_writing_deferred": {
                "items": 4,
                "bytes": 20096
            },
            "bluestore_writing": {
                "items": 57,
                "bytes": 21594347
            },
            "bluefs": {
                "items": 757,
                "bytes": 18360
            },
            "bluefs_file_reader": {
                "items": 47,
                "bytes": 593024
            },
            "bluefs_file_writer": {
                "items": 3,
                "bytes": 576
            },
            "buffer_anon": {
                "items": 324,
                "bytes": 10723644
            },
            "buffer_meta": {
                "items": 212,
                "bytes": 18656
            },
            "osd": {
                "items": 98,
                "bytes": 1063584
            },
            "osd_mapbl": {
                "items": 0,
                "bytes": 0
            },
            "osd_pglog": {
                "items": 143921,
                "bytes": 74269128
            },
            "osdmap": {
                "items": 2041877,
                "bytes": 32926004
            },
            "osdmap_mapping": {
                "items": 0,
                "bytes": 0
            },
            "pgmap": {
                "items": 0,
                "bytes": 0
            },
            "mds_co": {
                "items": 0,
                "bytes": 0
            },
            "unittest_1": {
                "items": 0,
                "bytes": 0
            },
            "unittest_2": {
                "items": 0,
                "bytes": 0
            }
        },
        "total": {
            "items": 2236980,
            "bytes": 148285163
        }
    }
}
sh-4.4#

Comment 16 Mudit Agarwal 2022-01-11 11:34:47 UTC
Removing 4.9.z, not a TP blocker. Exists since long back and still needs investigation and QE input.

Comment 18 Mark Nelson 2022-02-16 13:03:13 UTC
@ekuric Ok, looking at those results it doesn't appear that the WAL buffers in rocksdb are backing up imho.  Josh Durgin mentioned that given we are seeing this with RBD mirroring, it may be related to snapshot trimming.

I went back and looked through an old thread on the user's mailing list.  Frank Schilder also saw memory growth seemingly due to snap trimming:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/TPIFMPQ6YHEK4GYH5LA6NWGRFXVW44MB/

He gathered tcmalloc heap profiles and believes he may have observed a memory leak of decoded data:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/TPIFMPQ6YHEK4GYH5LA6NWGRFXVW44MB/

It probably means we need to see if we allocate memory to decode some data structure from rocksdb related to snap trimming somewhere any never actually free it.

Comment 19 Mark Nelson 2022-02-16 16:04:11 UTC
Wrote that a little fast and can't edit, ignore the duplicate link please. :)

Comment 20 Josh Durgin 2022-02-16 16:08:40 UTC
Elvir can you get more detailed memory allocation info as Mark suggests in comment#15?

Comment 24 Scott Ostapovicz 2022-03-30 17:47:21 UTC
*** Bug 2069753 has been marked as a duplicate of this bug. ***

Comment 26 Mudit Agarwal 2022-07-05 13:13:52 UTC
Not a 4.11 blocker


Note You need to log in before you can comment on or make changes to this bug.