Bug 2021931
| Summary: | [Ceph Tracker bug #2185532] [DR] OSD crash with OOM when removing data | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Elvir Kuric <ekuric> | |
| Component: | ceph | Assignee: | Neha Ojha <nojha> | |
| ceph sub component: | RADOS | QA Contact: | Elad <ebenahar> | |
| Status: | CLOSED WORKSFORME | Docs Contact: | ||
| Severity: | high | |||
| Priority: | unspecified | CC: | bniver, ebenahar, jdurgin, jespy, kramdoss, kseeger, mmuench, muagarwa, nojha, odf-bz-bot, pdhange, prsurve, rsussman, shberry, sostapov, vumrao | |
| Version: | 4.9 | Keywords: | AutomationBackLog, Performance | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2185532 (view as bug list) | Environment: | ||
| Last Closed: | 2023-08-14 07:57:15 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 2185532 | |||
| Bug Blocks: | ||||
|
Comment 4
Mudit Agarwal
2021-11-15 08:18:09 UTC
Travis please take a quick look. In the OSD logs [1] I see lots of messages about the write buffer being full.
debug 2021-11-10T12:16:47.789+0000 7f5f50da6700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1636546607790955, "job": 34, "event": "flush_started", "num_memtables": 1, "num_entries": 5325, "num_deletes": 1767,
"total_data_size": 259268310, "memory_usage": 260140856, "flush_reason": "Write Buffer Full"}
Looks like a question for core ceph around the deletions handling.
Hi Elvir, Can you dump the mempools from one of the OSDs that using more memory than the osd_memory_target? There's a recent bug we saw on the mailing list where the pglog length grew excessively: https://www.spinics.net/lists/ceph-users/msg69599.html If the mempools don't tell us anything useful, it may also be worth enabling debug bluestore = 5 and debug prioritycache = 5 to make sure the cache autotuning is functioning properly (it always has been in the past though). ceph crash ls
ID ENTITY NEW
2021-11-15T19:45:16.701285Z_8211822e-d0b9-4f04-963e-2e239a0c843f osd.2 *
2021-11-15T23:40:59.464430Z_232ea6fe-20fe-46a7-978c-78cd650a1274 osd.5 *
2021-11-16T02:24:18.108029Z_ae6b9b02-a362-4ef9-a834-32503bc9a2a2 osd.5 *
2021-11-16T05:27:28.550505Z_ece08349-efd3-4ee5-aaf1-2f95fbd42872 osd.2 *
2021-11-16T09:26:47.048294Z_920d9273-46d5-42db-9c4e-795aadec9ae0 mon.a *
sh-4.4$ for z in $(ceph crash ls |grep -v ID |awk '{print $1}'); do echo "Crash ------- $z ---------- Crash" ; ceph crash info $z; done
Crash ------- 2021-11-15T19:45:16.701285Z_8211822e-d0b9-4f04-963e-2e239a0c843f ---------- Crash
{
"assert_condition": "abort",
"assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc",
"assert_func": "void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)",
"assert_line": 13122,
"assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7ff47866a700 time 2021-11-15T19:45:16.692129+0000\n/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc: 13122: ceph_abort_msg(\"unexpected error\")\n",
"assert_thread_name": "tp_osd_tp",
"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7ff499762b20]",
"gsignal()",
"abort()",
"(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x56530fac64d1]",
"(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x1507) [0x56531010bde7]",
"(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x56531010dc96]",
"(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x85) [0x56530fc31305]",
"(non-virtual thunk to PrimaryLogPG::queue_transaction(ceph::os::Transaction&&, boost::intrusive_ptr<OpRequest>)+0x53) [0x56530fd6a643]",
"(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x472) [0x56530ff656e2]",
"(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a8) [0x56530ff675a8]",
"(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x56530fd9d802]",
"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x56530fd409be]",
"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x56530fbca549]",
"(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x56530fe26fa8]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x56530fbea508]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x565310251934]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5653102545d4]",
"/lib64/libpthread.so.0(+0x814a) [0x7ff49975814a]",
"clone()"
],
"ceph_version": "16.2.0-143.el8cp",
"crash_id": "2021-11-15T19:45:16.701285Z_8211822e-d0b9-4f04-963e-2e239a0c843f",
"entity_name": "osd.2",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "8.4 (Ootpa)",
"os_version_id": "8.4",
"process_name": "ceph-osd",
"stack_sig": "fb66b70c75e7efa0b1494766a0622afe6f862679538a4cad0f264ca51e71da42",
"timestamp": "2021-11-15T19:45:16.701285Z",
"utsname_hostname": "rook-ceph-osd-2-6d8d985544-kw4d2",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}
Crash ------- 2021-11-15T23:40:59.464430Z_232ea6fe-20fe-46a7-978c-78cd650a1274 ---------- Crash
{
"assert_condition": "abort",
"assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc",
"assert_func": "void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)",
"assert_line": 13122,
"assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f9ae92e8700 time 2021-11-15T23:40:59.452631+0000\n/builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/BlueStore.cc: 13122: ceph_abort_msg(\"unexpected error\")\n",
"assert_thread_name": "tp_osd_tp",
"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7f9b103ecb20]",
"gsignal()",
"abort()",
"(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x556e17dd54d1]",
"(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x1507) [0x556e1841ade7]",
"(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x556e1841cc96]",
"(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x85) [0x556e17f40305]",
"(non-virtual thunk to PrimaryLogPG::queue_transaction(ceph::os::Transaction&&, boost::intrusive_ptr<OpRequest>)+0x53) [0x556e18079643]",
"(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x472) [0x556e182746e2]",
"(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a8) [0x556e182765a8]",
"(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x556e180ac802]",
"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x556e1804f9be]",
"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x556e17ed9549]",
"(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x556e18135fa8]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x556e17ef9508]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x556e18560934]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x556e185635d4]",
"/lib64/libpthread.so.0(+0x814a) [0x7f9b103e214a]",
"clone()"
],
"ceph_version": "16.2.0-143.el8cp",
"crash_id": "2021-11-15T23:40:59.464430Z_232ea6fe-20fe-46a7-978c-78cd650a1274",
"entity_name": "osd.5",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "8.4 (Ootpa)",
"os_version_id": "8.4",
"process_name": "ceph-osd",
"stack_sig": "fb66b70c75e7efa0b1494766a0622afe6f862679538a4cad0f264ca51e71da42",
"timestamp": "2021-11-15T23:40:59.464430Z",
"utsname_hostname": "rook-ceph-osd-5-7cf577f67c-fswjn",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}
Crash ------- 2021-11-16T02:24:18.108029Z_ae6b9b02-a362-4ef9-a834-32503bc9a2a2 ---------- Crash
{
"assert_condition": "is_primary()",
"assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/osd/PrimaryLogPG.cc",
"assert_func": "virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)",
"assert_line": 413,
"assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f3f29ccf700 time 2021-11-16T02:24:18.099792+0000\n/builddir/build/BUILD/ceph-16.2.0/src/osd/PrimaryLogPG.cc: 413: FAILED ceph_assert(is_primary())\n",
"assert_thread_name": "tp_osd_tp",
"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7f3f4f5d0b20]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x558b11833d11]",
"ceph-osd(+0x568eda) [0x558b11833eda]",
"(PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x332) [0x558b11a2d5a2]",
"(ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3bd) [0x558b11cd307d]",
"(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x24b) [0x558b11cd34bb]",
"(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a8) [0x558b11cd55a8]",
"(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x558b11b0b802]",
"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x558b11aae9be]",
"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x558b11938549]",
"(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x558b11b94fa8]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x558b11958508]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x558b11fbf934]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x558b11fc25d4]",
"/lib64/libpthread.so.0(+0x814a) [0x7f3f4f5c614a]",
"clone()"
],
"ceph_version": "16.2.0-143.el8cp",
"crash_id": "2021-11-16T02:24:18.108029Z_ae6b9b02-a362-4ef9-a834-32503bc9a2a2",
"entity_name": "osd.5",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "8.4 (Ootpa)",
"os_version_id": "8.4",
"process_name": "ceph-osd",
"stack_sig": "73067413339673026a198490b3316de6c6afc1e0280a3d04181ed63198708efb",
"timestamp": "2021-11-16T02:24:18.108029Z",
"utsname_hostname": "rook-ceph-osd-5-7cf577f67c-fswjn",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}
Crash ------- 2021-11-16T05:27:28.550505Z_ece08349-efd3-4ee5-aaf1-2f95fbd42872 ---------- Crash
{
"assert_condition": "p->second.need <= v || p->second.is_delete()",
"assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/osd/osd_types.h",
"assert_func": "void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]",
"assert_line": 4910,
"assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/osd/osd_types.h: In function 'void pg_missing_set<TrackChanges>::got(const hobject_t&, eversion_t) [with bool TrackChanges = false]' thread 7fbdd920e700 time 2021-11-16T05:27:28.471737+0000\n/builddir/build/BUILD/ceph-16.2.0/src/osd/osd_types.h: 4910: FAILED ceph_assert(p->second.need <= v || p->second.is_delete())\n",
"assert_thread_name": "tp_osd_tp",
"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7fbdfe30eb20]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x556252216d11]",
"ceph-osd(+0x568eda) [0x556252216eda]",
"(PeeringState::on_peer_recover(pg_shard_t, hobject_t const&, eversion_t const&)+0x1b4) [0x556252591264]",
"(ReplicatedBackend::handle_push_reply(pg_shard_t, PushReplyOp const&, PushOp*)+0x585) [0x5562526b0d35]",
"(ReplicatedBackend::do_push_reply(boost::intrusive_ptr<OpRequest>)+0x101) [0x5562526b39c1]",
"(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x237) [0x5562526b8537]",
"(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x5562524ee802]",
"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x5562524919be]",
"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55625231b549]",
"(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x556252577fa8]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x55625233b508]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5562529a2934]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5562529a55d4]",
"/lib64/libpthread.so.0(+0x814a) [0x7fbdfe30414a]",
"clone()"
],
"ceph_version": "16.2.0-143.el8cp",
"crash_id": "2021-11-16T05:27:28.550505Z_ece08349-efd3-4ee5-aaf1-2f95fbd42872",
"entity_name": "osd.2",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "8.4 (Ootpa)",
"os_version_id": "8.4",
"process_name": "ceph-osd",
"stack_sig": "85753395395b5d23971763812d6c1b4ed9a1fdbb2c56a90ae31a8ecf61472de5",
"timestamp": "2021-11-16T05:27:28.550505Z",
"utsname_hostname": "rook-ceph-osd-2-6d8d985544-kw4d2",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}
Crash ------- 2021-11-16T09:26:47.048294Z_920d9273-46d5-42db-9c4e-795aadec9ae0 ---------- Crash
{
"assert_condition": "(sharded_in_flight_list.back())->ops_in_flight_sharded.empty()",
"assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/common/TrackedOp.cc",
"assert_func": "OpTracker::~OpTracker()",
"assert_line": 173,
"assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/common/TrackedOp.cc: In function 'OpTracker::~OpTracker()' thread 7f84ab877700 time 2021-11-16T09:26:47.045591+0000\n/builddir/build/BUILD/ceph-16.2.0/src/common/TrackedOp.cc: 173: FAILED ceph_assert((sharded_in_flight_list.back())->ops_in_flight_sharded.empty())\n",
"assert_thread_name": "ceph-mon",
"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7f84a0508b20]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f84a2a0a5f1]",
"/usr/lib64/ceph/libceph-common.so.2(+0x2767ba) [0x7f84a2a0a7ba]",
"(OpTracker::~OpTracker()+0x39) [0x7f84a2affa99]",
"(Monitor::~Monitor()+0xac) [0x5612e47be6cc]",
"(Monitor::~Monitor()+0xd) [0x5612e47bf1ad]",
"main()",
"__libc_start_main()",
"_start()"
],
"ceph_version": "16.2.0-143.el8cp",
"crash_id": "2021-11-16T09:26:47.048294Z_920d9273-46d5-42db-9c4e-795aadec9ae0",
"entity_name": "mon.a",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "8.4 (Ootpa)",
"os_version_id": "8.4",
"process_name": "ceph-mon",
"stack_sig": "76e590fa2428cd05276f0da5e7a237b294c18401786d06e48f0176f55d44ba60",
"timestamp": "2021-11-16T09:26:47.048294Z",
"utsname_hostname": "rook-ceph-mon-a-547c944c66-ghmqh",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Tue Sep 7 07:07:31 EDT 2021"
}
@mnelson would you please confirm that it is (or is not) a pglog problem. some OSDs crashed , as soon I was able to execute to them "oc rsh osd-pod" ... got below for ceph daemon ...dump_mempools
----
ceph daemon osd.3 dump_mempools
{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 5158,
"bytes": 412640
},
"bluestore_cache_data": {
"items": 1339,
"bytes": 62044609
},
"bluestore_cache_onode": {
"items": 2069,
"bytes": 1274504
},
"bluestore_cache_meta": {
"items": 176556,
"bytes": 1610916
},
"bluestore_cache_other": {
"items": 122512,
"bytes": 5575784
},
"bluestore_Buffer": {
"items": 1318,
"bytes": 126528
},
"bluestore_Extent": {
"items": 35830,
"bytes": 1719840
},
"bluestore_Blob": {
"items": 35774,
"bytes": 3720496
},
"bluestore_SharedBlob": {
"items": 25468,
"bytes": 2852416
},
"bluestore_inline_bl": {
"items": 957,
"bytes": 325454
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 0,
"bytes": 0
},
"bluestore_writing_deferred": {
"items": 0,
"bytes": 0
},
"bluestore_writing": {
"items": 68,
"bytes": 14086280
},
"bluefs": {
"items": 1341,
"bytes": 42704
},
"bluefs_file_reader": {
"items": 166,
"bytes": 596352
},
"bluefs_file_writer": {
"items": 3,
"bytes": 576
},
"buffer_anon": {
"items": 4707,
"bytes": 257878260
},
"buffer_meta": {
"items": 2897,
"bytes": 254936
},
"osd": {
"items": 90,
"bytes": 1018080
},
"osd_mapbl": {
"items": 0,
"bytes": 0
},
"osd_pglog": {
"items": 126766,
"bytes": 65270832
},
"osdmap": {
"items": 2034608,
"bytes": 32809328
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 2577627,
"bytes": 451620535
}
}
}
----
unset CEPH_ARGS
sh-4.4# ceph daemon osd.6 dump_mempools
{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 5836,
"bytes": 466880
},
"bluestore_cache_data": {
"items": 271,
"bytes": 4891083
},
"bluestore_cache_onode": {
"items": 422,
"bytes": 259952
},
"bluestore_cache_meta": {
"items": 24743,
"bytes": 178451
},
"bluestore_cache_other": {
"items": 9739,
"bytes": 430224
},
"bluestore_Buffer": {
"items": 133,
"bytes": 12768
},
"bluestore_Extent": {
"items": 2934,
"bytes": 140832
},
"bluestore_Blob": {
"items": 2934,
"bytes": 305136
},
"bluestore_SharedBlob": {
"items": 2475,
"bytes": 277200
},
"bluestore_inline_bl": {
"items": 192,
"bytes": 94434
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 1,
"bytes": 784
},
"bluestore_writing_deferred": {
"items": 4,
"bytes": 20096
},
"bluestore_writing": {
"items": 57,
"bytes": 21594347
},
"bluefs": {
"items": 757,
"bytes": 18360
},
"bluefs_file_reader": {
"items": 47,
"bytes": 593024
},
"bluefs_file_writer": {
"items": 3,
"bytes": 576
},
"buffer_anon": {
"items": 324,
"bytes": 10723644
},
"buffer_meta": {
"items": 212,
"bytes": 18656
},
"osd": {
"items": 98,
"bytes": 1063584
},
"osd_mapbl": {
"items": 0,
"bytes": 0
},
"osd_pglog": {
"items": 143921,
"bytes": 74269128
},
"osdmap": {
"items": 2041877,
"bytes": 32926004
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 2236980,
"bytes": 148285163
}
}
}
sh-4.4#
Removing 4.9.z, not a TP blocker. Exists since long back and still needs investigation and QE input. @ekuric Ok, looking at those results it doesn't appear that the WAL buffers in rocksdb are backing up imho. Josh Durgin mentioned that given we are seeing this with RBD mirroring, it may be related to snapshot trimming. I went back and looked through an old thread on the user's mailing list. Frank Schilder also saw memory growth seemingly due to snap trimming: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/TPIFMPQ6YHEK4GYH5LA6NWGRFXVW44MB/ He gathered tcmalloc heap profiles and believes he may have observed a memory leak of decoded data: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/TPIFMPQ6YHEK4GYH5LA6NWGRFXVW44MB/ It probably means we need to see if we allocate memory to decode some data structure from rocksdb related to snap trimming somewhere any never actually free it. Wrote that a little fast and can't edit, ignore the duplicate link please. :) Elvir can you get more detailed memory allocation info as Mark suggests in comment#15? *** Bug 2069753 has been marked as a duplicate of this bug. *** Not a 4.11 blocker |