Bug 2058524

Summary: MDS crash / rook-ceph-mds-ocs-storagecluster-cephfilesystem-a/b stuck in a CrashLoopBackOff
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: th3gov
Component: cephAssignee: Mudit Agarwal <muagarwa>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.9CC: bniver, madam, mmuench, muagarwa, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-27 10:47:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
rook-ceph-mds-ocs-storagecluster-cephfilesystem-crashlog none

Description th3gov 2022-02-25 08:52:55 UTC
Created attachment 1863281 [details]
rook-ceph-mds-ocs-storagecluster-cephfilesystem-crashlog

Description of problem:
After upgrading from OpenShift Container Storage 4.8.8 to OpenShift Data Foundation 4.9.2, the mds container from the pods rook-ceph-mds-ocs-storagecluster-cephfilesystem-a/b is not starting / stuck in CrashLoopBackOff. I do not see any out of memory errors in the Events.

In the logs I found the following error:

debug     -1> 2022-02-24T14:37:50.432+0000 7f9bbe952700 -1 /builddir/build/BUILD/ceph-16.2.0/src/include/cephfs/metrics/Types.h: In function 'std::ostream& operator<<(std::ostream&, const ClientMetricType&)' thread 7f9bbe952700 time 2022-02-24T14:37:50.432534+0000
/builddir/build/BUILD/ceph-16.2.0/src/include/cephfs/metrics/Types.h: 56: ceph_abort_msg("abort() called")

 ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7f9bc75c12a0]
 2: (operator<<(std::ostream&, ClientMetricType const&)+0x10e) [0x7f9bc78480ee]
 3: (MClientMetrics::print(std::ostream&) const+0x1a1) [0x7f9bc78482c1]
 4: (DispatchQueue::entry()+0x1be2) [0x7f9bc77fdfa2]
 5: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f9bc78ad8a1]
 6: /lib64/libpthread.so.0(+0x817a) [0x7f9bc636117a]
 7: clone()

debug      0> 2022-02-24T14:37:50.434+0000 7f9bbe952700 -1 *** Caught signal (Aborted) **
 in thread 7f9bbe952700 thread_name:ms_dispatch

 ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12c20) [0x7f9bc636bc20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x7f9bc75c1371]
 5: (operator<<(std::ostream&, ClientMetricType const&)+0x10e) [0x7f9bc78480ee]
 6: (MClientMetrics::print(std::ostream&) const+0x1a1) [0x7f9bc78482c1]
 7: (DispatchQueue::entry()+0x1be2) [0x7f9bc77fdfa2]
 8: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f9bc78ad8a1]
 9: /lib64/libpthread.so.0(+0x817a) [0x7f9bc636117a]
 10: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Version of all relevant components (if applicable):
OpenShift Data Foundation 4.9.2


Is there any workaround available to the best of your knowledge?
Maybe https://access.redhat.com/solutions/6617781 but I don't know if its applicable to ODF.


Can this issue reproducible?
Maybe this issue only occurs only in combination with Red Hat OpenShift Logging v5.3.4-13 and OpenShift Elasticsearch Operator v5.3.4-13. But I don't know for sure if its reproducible.

Comment 2 th3gov 2022-02-28 08:15:34 UTC
It seems I found a workaround:
After I disabled the "Console plugin" from ODF, the mds pods are not crashing anymore.

Comment 3 Scott Ostapovicz 2022-03-14 14:53:06 UTC
Not sure which component this would be.