Bug 2149060
| Summary: | CephFS - MDS pods are crashed while create many PVs on cephFS storageClass | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | David Vaanunu <dvaanunu> |
| Component: | ceph | Assignee: | Venky Shankar <vshankar> |
| ceph sub component: | CephFS | QA Contact: | Elad <ebenahar> |
| Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | bniver, gfarnum, hyelloji, muagarwa, ocs-bugs, odf-bz-bot, vshankar |
| Version: | 4.11 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-03-28 04:37:11 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
David Vaanunu
2022-11-28 17:49:20 UTC
The crash backtrace is:
ceph version 16.2.8-84.el8cp (c2980f2fd700e979d41b4bad2939bb90f0fe435c) pacific (stable)
1: /lib64/libpthread.so.0(+0x12ce0) [0x7f5b3a243ce0]
2: (std::_Rb_tree<dirfrag_t, dirfrag_t, std::_Identity<dirfrag_t>, std::less<dirfrag_t>, std::allocator<dirfrag_t> >::equal_range(dirfrag_t const&)+0x2b) [0x55dbb8e320ab]
3: (std::_Rb_tree<dirfrag_t, dirfrag_t, std::_Identity<dirfrag_t>, std::less<dirfrag_t>, std::allocator<dirfrag_t> >::erase(dirfrag_t const&)+0x1a) [0x55dbb8e321fa]
4: (MDCache::finish_uncommitted_fragment(dirfrag_t, int)+0x91) [0x55dbb8dd96f1]
5: (EFragment::replay(MDSRank*)+0x430) [0x55dbb904d360]
6: (MDLog::_replay_thread()+0xcd1) [0x55dbb8fd07f1]
7: (MDLog::ReplayThread::entry()+0x11) [0x55dbb8ccce11]
8: /lib64/libpthread.so.0(+0x81cf) [0x7f5b3a2391cf]
9: clone()
This is happening in
>uf.ls->uncommitted_fragments.erase(basedirfrag);
from:
void MDCache::finish_uncommitted_fragment(dirfrag_t basedirfrag, int op)
{
dout(10) << "finish_uncommitted_fragments: base dirfrag " << basedirfrag
<< " op " << EFragment::op_name(op) << dendl;
map<dirfrag_t, ufragment>::iterator it = uncommitted_fragments.find(basedirfrag);
if (it != uncommitted_fragments.end()) {
ufragment& uf = it->second;
if (op != EFragment::OP_FINISH && !uf.old_frags.empty()) {
uf.committed = true;
} else {
uf.ls->uncommitted_fragments.erase(basedirfrag);
mds->queue_waiters(uf.waiters);
uncommitted_fragments.erase(it);
}
}
}
Which is a crash in the standard library call - uncommitted_fragments is std::set<>. Can't tell right now what happened.
Please turn up MDS debugging ("debug mds = 20", in particular) and boot the MDS again. That should give us more information about the specific EFragment that is causing issues.
These logs don't include any crashes, unlike the prior ones. Did the MDS go active in the interval? Or are they incomplete? MDS logs from 2022-12-02 https://drive.google.com/drive/folders/1W8pJ-5UugswxTtQDVrfay5mIX3JVvZpR?usp=sharing |