Description of problem (please be detailed as possible and provide log snippets): We are running benchmarks of backup&restore PVs using volsync (restic). while trying to backup/create PVs over cephFS (100-200) I got errors with MDS. In the beginning, MDS crashed on OOM, so I increased both CPU & Memory spec. After that, mds still crashed many times: rook-ceph-crashcollector-worker010-r640-6bbf467977-q4w9t 1/1 Running 0 8d rook-ceph-crashcollector-worker011-r640-7878d85b86-kkkh2 1/1 Running 0 7d23h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-76846b95hw957 2/2 Running 416 (6m10s ago) 2d6h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6fdd8d5fcsx7g 2/2 Running 398 (2m20s ago) 2d6h rook-ceph-mgr-a-69dd6b996-gq2r7 2/2 Running 0 8d rook-ceph-mon-a-7f85b88b95-c8dh6 2/2 Running 0 8d The logs show: : 804867 cap_misses: 322 num_caps: 8][client_metric_type: READ_LATENCY latency: 0.000000][client_metric_type: WRITE_LATENCY latency: 1992-08-25T09:16:20.518000+0000][client_metric_type: METADATA_LATENCY latency: 1996-01-31T20:47:11.147000+0000][client_metric_type: DENTRY_LEASE dlease_hits: 160645 dlease_misses: 0 num_dentries: 5][client_metric_type: OPENED_FILES opened_files: 0 total_inodes: 8][client_metric_type: PINNED_ICAPS pinned_icaps: 8 total_inodes: 8][client_metric_type: OPENED_INODES opened_inodes: 0 total_inodes: 8][client_metric_type: READ_IO_SIZES total_ops: 0 total_size: 0][client_metric_type: WRITE_IO_SIZES total_ops: 80320 total_size: 328990720] v1 from client.1361432 v1:10.130.2.1:0/1120438752 Version of all relevant components (if applicable): OCP ver 4.11.7 ODF 4.11.3 BareMetal env with 3 masters & 12 workers Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes. can't create/delete PVs on cephFS storageclass Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Create pods with data on cephFS storage class 2. monitor storage pod and verify if MDS crashed 3. Actual results: MDS pods crashed Expected results: finish to create the Podsand MDS not crashed Additional info: pods log file & oc describe https://drive.google.com/drive/folders/1qy9J88A2YfOe_qpt0a0WQ-8LdzIj762h?usp=sharing
The crash backtrace is: ceph version 16.2.8-84.el8cp (c2980f2fd700e979d41b4bad2939bb90f0fe435c) pacific (stable) 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f5b3a243ce0] 2: (std::_Rb_tree<dirfrag_t, dirfrag_t, std::_Identity<dirfrag_t>, std::less<dirfrag_t>, std::allocator<dirfrag_t> >::equal_range(dirfrag_t const&)+0x2b) [0x55dbb8e320ab] 3: (std::_Rb_tree<dirfrag_t, dirfrag_t, std::_Identity<dirfrag_t>, std::less<dirfrag_t>, std::allocator<dirfrag_t> >::erase(dirfrag_t const&)+0x1a) [0x55dbb8e321fa] 4: (MDCache::finish_uncommitted_fragment(dirfrag_t, int)+0x91) [0x55dbb8dd96f1] 5: (EFragment::replay(MDSRank*)+0x430) [0x55dbb904d360] 6: (MDLog::_replay_thread()+0xcd1) [0x55dbb8fd07f1] 7: (MDLog::ReplayThread::entry()+0x11) [0x55dbb8ccce11] 8: /lib64/libpthread.so.0(+0x81cf) [0x7f5b3a2391cf] 9: clone() This is happening in >uf.ls->uncommitted_fragments.erase(basedirfrag); from: void MDCache::finish_uncommitted_fragment(dirfrag_t basedirfrag, int op) { dout(10) << "finish_uncommitted_fragments: base dirfrag " << basedirfrag << " op " << EFragment::op_name(op) << dendl; map<dirfrag_t, ufragment>::iterator it = uncommitted_fragments.find(basedirfrag); if (it != uncommitted_fragments.end()) { ufragment& uf = it->second; if (op != EFragment::OP_FINISH && !uf.old_frags.empty()) { uf.committed = true; } else { uf.ls->uncommitted_fragments.erase(basedirfrag); mds->queue_waiters(uf.waiters); uncommitted_fragments.erase(it); } } } Which is a crash in the standard library call - uncommitted_fragments is std::set<>. Can't tell right now what happened.
Please turn up MDS debugging ("debug mds = 20", in particular) and boot the MDS again. That should give us more information about the specific EFragment that is causing issues.
These logs don't include any crashes, unlike the prior ones. Did the MDS go active in the interval? Or are they incomplete?
MDS logs from 2022-12-02 https://drive.google.com/drive/folders/1W8pJ-5UugswxTtQDVrfay5mIX3JVvZpR?usp=sharing