Bug 2058524 - MDS crash / rook-ceph-mds-ocs-storagecluster-cephfilesystem-a/b stuck in a CrashLoopBackOff
Summary: MDS crash / rook-ceph-mds-ocs-storagecluster-cephfilesystem-a/b stuck in a Cr...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Mudit Agarwal
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-25 08:52 UTC by th3gov
Modified: 2023-08-09 16:37 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-27 10:47:56 UTC
Embargoed:


Attachments (Terms of Use)
rook-ceph-mds-ocs-storagecluster-cephfilesystem-crashlog (106.77 KB, text/plain)
2022-02-25 08:52 UTC, th3gov
no flags Details

Description th3gov 2022-02-25 08:52:55 UTC
Created attachment 1863281 [details]
rook-ceph-mds-ocs-storagecluster-cephfilesystem-crashlog

Description of problem:
After upgrading from OpenShift Container Storage 4.8.8 to OpenShift Data Foundation 4.9.2, the mds container from the pods rook-ceph-mds-ocs-storagecluster-cephfilesystem-a/b is not starting / stuck in CrashLoopBackOff. I do not see any out of memory errors in the Events.

In the logs I found the following error:

debug     -1> 2022-02-24T14:37:50.432+0000 7f9bbe952700 -1 /builddir/build/BUILD/ceph-16.2.0/src/include/cephfs/metrics/Types.h: In function 'std::ostream& operator<<(std::ostream&, const ClientMetricType&)' thread 7f9bbe952700 time 2022-02-24T14:37:50.432534+0000
/builddir/build/BUILD/ceph-16.2.0/src/include/cephfs/metrics/Types.h: 56: ceph_abort_msg("abort() called")

 ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7f9bc75c12a0]
 2: (operator<<(std::ostream&, ClientMetricType const&)+0x10e) [0x7f9bc78480ee]
 3: (MClientMetrics::print(std::ostream&) const+0x1a1) [0x7f9bc78482c1]
 4: (DispatchQueue::entry()+0x1be2) [0x7f9bc77fdfa2]
 5: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f9bc78ad8a1]
 6: /lib64/libpthread.so.0(+0x817a) [0x7f9bc636117a]
 7: clone()

debug      0> 2022-02-24T14:37:50.434+0000 7f9bbe952700 -1 *** Caught signal (Aborted) **
 in thread 7f9bbe952700 thread_name:ms_dispatch

 ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12c20) [0x7f9bc636bc20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x7f9bc75c1371]
 5: (operator<<(std::ostream&, ClientMetricType const&)+0x10e) [0x7f9bc78480ee]
 6: (MClientMetrics::print(std::ostream&) const+0x1a1) [0x7f9bc78482c1]
 7: (DispatchQueue::entry()+0x1be2) [0x7f9bc77fdfa2]
 8: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f9bc78ad8a1]
 9: /lib64/libpthread.so.0(+0x817a) [0x7f9bc636117a]
 10: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Version of all relevant components (if applicable):
OpenShift Data Foundation 4.9.2


Is there any workaround available to the best of your knowledge?
Maybe https://access.redhat.com/solutions/6617781 but I don't know if its applicable to ODF.


Can this issue reproducible?
Maybe this issue only occurs only in combination with Red Hat OpenShift Logging v5.3.4-13 and OpenShift Elasticsearch Operator v5.3.4-13. But I don't know for sure if its reproducible.

Comment 2 th3gov 2022-02-28 08:15:34 UTC
It seems I found a workaround:
After I disabled the "Console plugin" from ODF, the mds pods are not crashing anymore.

Comment 3 Scott Ostapovicz 2022-03-14 14:53:06 UTC
Not sure which component this would be.


Note You need to log in before you can comment on or make changes to this bug.