Bug 2149060

Summary: CephFS - MDS pods are crashed while create many PVs on cephFS storageClass
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: David Vaanunu <dvaanunu>
Component: cephAssignee: Venky Shankar <vshankar>
ceph sub component: CephFS QA Contact: Elad <ebenahar>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: high    
Priority: unspecified CC: bniver, gfarnum, hyelloji, muagarwa, ocs-bugs, odf-bz-bot, vshankar
Version: 4.11   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-28 04:37:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Vaanunu 2022-11-28 17:49:20 UTC
Description of problem (please be detailed as possible and provide log
snippets):
We are running benchmarks of backup&restore PVs using volsync (restic).
while trying to backup/create PVs over cephFS (100-200) I got errors with MDS.
In the beginning, MDS crashed on OOM, so I increased both CPU & Memory spec.

After that, mds still crashed many times:
rook-ceph-crashcollector-worker010-r640-6bbf467977-q4w9t          1/1     Running   0                 8d
rook-ceph-crashcollector-worker011-r640-7878d85b86-kkkh2          1/1     Running   0                 7d23h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-76846b95hw957   2/2     Running   416 (6m10s ago)   2d6h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6fdd8d5fcsx7g   2/2     Running   398 (2m20s ago)   2d6h
rook-ceph-mgr-a-69dd6b996-gq2r7                                   2/2     Running   0                 8d
rook-ceph-mon-a-7f85b88b95-c8dh6                                  2/2     Running   0                 8d
The logs show:
: 804867 cap_misses: 322 num_caps: 8][client_metric_type: READ_LATENCY latency: 0.000000][client_metric_type: WRITE_LATENCY latency: 1992-08-25T09:16:20.518000+0000][client_metric_type: METADATA_LATENCY latency: 1996-01-31T20:47:11.147000+0000][client_metric_type: DENTRY_LEASE dlease_hits: 160645 dlease_misses: 0 num_dentries: 5][client_metric_type: OPENED_FILES opened_files: 0 total_inodes: 8][client_metric_type: PINNED_ICAPS pinned_icaps: 8 total_inodes: 8][client_metric_type: OPENED_INODES opened_inodes: 0 total_inodes: 8][client_metric_type: READ_IO_SIZES total_ops: 0 total_size: 0][client_metric_type: WRITE_IO_SIZES total_ops: 80320 total_size: 328990720] v1 from client.1361432 v1:10.130.2.1:0/1120438752




Version of all relevant components (if applicable):
OCP ver 4.11.7 
ODF 4.11.3
BareMetal env with 3 masters & 12 workers

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? yes. can't create/delete PVs on cephFS storageclass


Is there any workaround available to the best of your knowledge? No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 2


Can this issue reproducible? Yes


Can this issue reproduce from the UI? No


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create pods with data on cephFS storage class
2. monitor storage pod and verify if MDS crashed
3.


Actual results:
MDS pods crashed

Expected results:
finish to create the Podsand MDS not crashed

Additional info:
pods log file & oc describe

https://drive.google.com/drive/folders/1qy9J88A2YfOe_qpt0a0WQ-8LdzIj762h?usp=sharing

Comment 2 Venky Shankar 2022-11-29 07:15:33 UTC
The crash backtrace is:

ceph version 16.2.8-84.el8cp (c2980f2fd700e979d41b4bad2939bb90f0fe435c) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f5b3a243ce0]
 2: (std::_Rb_tree<dirfrag_t, dirfrag_t, std::_Identity<dirfrag_t>, std::less<dirfrag_t>, std::allocator<dirfrag_t> >::equal_range(dirfrag_t const&)+0x2b) [0x55dbb8e320ab]
 3: (std::_Rb_tree<dirfrag_t, dirfrag_t, std::_Identity<dirfrag_t>, std::less<dirfrag_t>, std::allocator<dirfrag_t> >::erase(dirfrag_t const&)+0x1a) [0x55dbb8e321fa]
 4: (MDCache::finish_uncommitted_fragment(dirfrag_t, int)+0x91) [0x55dbb8dd96f1]
 5: (EFragment::replay(MDSRank*)+0x430) [0x55dbb904d360]
 6: (MDLog::_replay_thread()+0xcd1) [0x55dbb8fd07f1]
 7: (MDLog::ReplayThread::entry()+0x11) [0x55dbb8ccce11]
 8: /lib64/libpthread.so.0(+0x81cf) [0x7f5b3a2391cf]
 9: clone()

This is happening in

>uf.ls->uncommitted_fragments.erase(basedirfrag);

from:

void MDCache::finish_uncommitted_fragment(dirfrag_t basedirfrag, int op)
{
  dout(10) << "finish_uncommitted_fragments: base dirfrag " << basedirfrag
           << " op " << EFragment::op_name(op) << dendl;
  map<dirfrag_t, ufragment>::iterator it = uncommitted_fragments.find(basedirfrag);
  if (it != uncommitted_fragments.end()) {
    ufragment& uf = it->second;
    if (op != EFragment::OP_FINISH && !uf.old_frags.empty()) {
      uf.committed = true;
    } else {
      uf.ls->uncommitted_fragments.erase(basedirfrag);
      mds->queue_waiters(uf.waiters);
      uncommitted_fragments.erase(it);
    }
  }
}


Which is a crash in the standard library call - uncommitted_fragments is std::set<>. Can't tell right now what happened.

Comment 3 Greg Farnum 2022-11-30 07:04:42 UTC
Please turn up MDS debugging ("debug mds = 20", in particular) and boot the MDS again. That should give us more information about the specific EFragment that is causing issues.

Comment 8 Greg Farnum 2022-12-01 15:10:21 UTC
These logs don't include any crashes, unlike the prior ones. Did the MDS go active in the interval? Or are they incomplete?

Comment 9 David Vaanunu 2022-12-02 10:26:05 UTC
MDS logs from 2022-12-02

https://drive.google.com/drive/folders/1W8pJ-5UugswxTtQDVrfay5mIX3JVvZpR?usp=sharing