Bug 2029778 - [cee/sd][rados] ceph-osd daemon crashed with Segmentation fault in thread 7f79a8cc0700 thread_name:msgr-worker-2
Summary: [cee/sd][rados] ceph-osd daemon crashed with Segmentation fault in thread 7...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 5.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 5.1
Assignee: Radoslaw Zarzynski
QA Contact: skanta
Ranjini M N
URL:
Whiteboard:
Depends On:
Blocks: 2031073
TreeView+ depends on / blocked
 
Reported: 2021-12-07 10:00 UTC by Prasanth M V
Modified: 2024-03-26 10:26 UTC (History)
20 users (show)

Fixed In Version: ceph-16.2.7-4.el8cp
Doc Type: Bug Fix
Doc Text:
.No crash is expected when using the Ceph messenger instance during disconnection A Ceph messenger instance maintains an internal registry of connections that can be accessed from multiple threads. These multi-threaded accesses require proper synchronization. Previously, when a connection was being unregistered, for example, disconnected, unsynchronized access would happen to result in the crash of the messenger’s user, such as a Ceph OSD daemon. This release implements proper synchronization at the unregister stage and therefore, no crashes are expected.
Clone Of:
Environment:
Last Closed: 2022-04-04 10:23:34 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 50483 0 None None None 2021-12-07 16:39:43 UTC
Github ceph ceph pull 43548 0 None Merged pacific: msgr/async: fix unsafe access in unregister_conn() 2021-12-07 16:41:22 UTC
Red Hat Issue Tracker RHCEPH-2517 0 None None None 2021-12-07 10:10:18 UTC
Red Hat Product Errata RHSA-2022:1174 0 None None None 2022-04-04 10:23:51 UTC

Description Prasanth M V 2021-12-07 10:00:44 UTC
Description of problem:

The osd daemon got crashed once in ceph version 16.2.0-117.el8cp(RHCS 5.0) in thread 7f79a8cc0700 thread_name:msgr-worker-2.

The crash info:
   {
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f79adcbbb20]",
        "(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x2c) [0x5557c7d6cacc]",
        "(AsyncConnection::_stop()+0xab) [0x5557c7d66c7b]",
        "(ProtocolV2::stop()+0x8f) [0x5557c7d91d5f]",
        "(ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> const&)+0x742) [0x5557c7da74a2]",
        "(ProtocolV2::handle_client_ident(ceph::buffer::v15_2_0::list&)+0xeef) [0x5557c7da8d3f]",
        "(ProtocolV2::handle_frame_payload()+0x20b) [0x5557c7da934b]",
        "(ProtocolV2::handle_read_frame_dispatch()+0x160) [0x5557c7da95d0]",
        "(ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x5557c7da97c5]",
        "(ProtocolV2::_handle_read_frame_segment()+0x92) [0x5557c7da9872]",
        "(ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x201) [0x5557c7daa9c1]",
        "(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x5557c7d92bfc]",
        "(AsyncConnection::process()+0x789) [0x5557c7d69d19]",
        "(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x5557c7bb8797]",
        "/usr/bin/ceph-osd(+0xe8f2bc) [0x5557c7bbc2bc]",
        "/lib64/libstdc++.so.6(+0xc2ba3) [0x7f79ad306ba3]",
        "/lib64/libpthread.so.0(+0x814a) [0x7f79adcb114a]",
        "clone()"
    ],
    "ceph_version": "16.2.0-117.el8cp",
    "crash_id": "2021-11-29T23:28:04.044560Z_c336b554-0ea2-4adc-88e7-a595182acb5e",
    "entity_name": "osd.1",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.4 (Ootpa)",
    "os_version_id": "8.4",
    "process_name": "ceph-osd",
    "stack_sig": "f0a30000aaf2ae26cfc68aa3a57d6101fd483063ed20b285a94140185b036bff",
    "timestamp": "2021-11-29T23:28:04.044560Z",
    "utsname_hostname": "ceph-1",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-240.el8.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Wed Sep 23 05:13:10 EDT 2020"
}


We could find the same details from the journalctl of the osd daemon which was crashed. The journalctl says:

Nov 30 00:28:04 ceph-1 conmon[375101]: *** Caught signal (Segmentation fault) **
Nov 30 00:28:04 ceph-1 conmon[375101]:  in thread 7f79a8cc0700 thread_name:msgr-worker-2
Nov 30 00:28:04 ceph-1 conmon[375101]:  ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)
Nov 30 00:28:04 ceph-1 conmon[375101]:  1: /lib64/libpthread.so.0(+0x12b20) [0x7f79adcbbb20]
Nov 30 00:28:04 ceph-1 conmon[375101]:  2: (std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x2c) [0x5557c7d6cacc]
Nov 30 00:28:04 ceph-1 conmon[375101]:  3: (AsyncConnection::_stop()+0xab) [0x5557c7d66c7b]
Nov 30 00:28:04 ceph-1 conmon[375101]:  4: (ProtocolV2::stop()+0x8f) [0x5557c7d91d5f]
Nov 30 00:28:04 ceph-1 conmon[375101]:  5: (ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> const&)+0x742) [0x5557c7da74a2]
Nov 30 00:28:04 ceph-1 conmon[375101]:  6: (ProtocolV2::handle_client_ident(ceph::buffer::v15_2_0::list&)+0xeef) [0x5557c7da8d3f]
Nov 30 00:28:04 ceph-1 conmon[375101]:  7: (ProtocolV2::handle_frame_payload()+0x20b) [0x5557c7da934b]
Nov 30 00:28:04 ceph-1 conmon[375101]:  8: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x5557c7da95d0]
Nov 30 00:28:04 ceph-1 conmon[375101]:  9: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x5557c7da97c5]
Nov 30 00:28:04 ceph-1 conmon[375101]:  10: (ProtocolV2::_handle_read_frame_segment()+0x92) [0x5557c7da9872]
Nov 30 00:28:04 ceph-1 conmon[375101]:  11: (ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x201) [0x5557c7daa9c1]
Nov 30 00:28:04 ceph-1 conmon[375101]:  12: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x5557c7d92bfc]
Nov 30 00:28:04 ceph-1 conmon[375101]:  13: (AsyncConnection::process()+0x789) [0x5557c7d69d19]
Nov 30 00:28:04 ceph-1 conmon[375101]:  14: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x5557c7bb8797]
Nov 30 00:28:04 ceph-1 conmon[375101]:  15: /usr/bin/ceph-osd(+0xe8f2bc) [0x5557c7bbc2bc]
Nov 30 00:28:04 ceph-1 conmon[375101]:  16: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f79ad306ba3]
Nov 30 00:28:04 ceph-1 conmon[375101]:  17: /lib64/libpthread.so.0(+0x814a) [0x7f79adcb114a]
Nov 30 00:28:04 ceph-1 conmon[375101]:  18: clone()
Nov 30 00:28:04 ceph-1 conmon[375101]: debug 2021-11-29T23:28:04.045+0000 7f79a8cc0700 -1 *** Caught signal (Segmentation fault) **
Nov 30 00:28:04 ceph-1 conmon[375101]:  in thread 7f79a8cc0700 thread_name:msgr-worker-2
Nov 30 00:28:04 ceph-1 conmon[375101]:
Nov 30 00:28:04 ceph-1 conmon[375101]:  ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)
Nov 30 00:28:04 ceph-1 conmon[375101]:  1: /lib64/libpthread.so.0(+0x12b20) [0x7f79adcbbb20]
Nov 30 00:28:04 ceph-1 conmon[375101]:  2: (std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x2c) [0x5557c7d6cacc]
Nov 30 00:28:04 ceph-1 conmon[375101]:  3: (AsyncConnection::_stop()+0xab) [0x5557c7d66c7b]
Nov 30 00:28:04 ceph-1 conmon[375101]:  4: (ProtocolV2::stop()+0x8f) [0x5557c7d91d5f]
Nov 30 00:28:04 ceph-1 conmon[375101]:  5: (ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> const&)+0x742) [0x5557c7da74a2]
Nov 30 00:28:04 ceph-1 conmon[375101]:  6: (ProtocolV2::handle_client_ident(ceph::buffer::v15_2_0::list&)+0xeef) [0x5557c7da8d3f]
Nov 30 00:28:04 ceph-1 conmon[375101]:  7: (ProtocolV2::handle_frame_payload()+0x20b) [0x5557c7da934b]
Nov 30 00:28:04 ceph-1 conmon[375101]:  8: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x5557c7da95d0]
Nov 30 00:28:04 ceph-1 conmon[375101]:  9: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x5557c7da97c5]
Nov 30 00:28:04 ceph-1 conmon[375101]:  10: (ProtocolV2::_handle_read_frame_segment()+0x92) [0x5557c7da9872]
Nov 30 00:28:04 ceph-1 conmon[375101]:  11: (ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x201) [0x5557c7daa9c1]
Nov 30 00:28:04 ceph-1 conmon[375101]:  12: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x5557c7d92bfc]
Nov 30 00:28:04 ceph-1 conmon[375101]:  13: (AsyncConnection::process()+0x789) [0x5557c7d69d19]
Nov 30 00:28:04 ceph-1 conmon[375101]:  14: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x5557c7bb8797]
Nov 30 00:28:04 ceph-1 conmon[375101]:  15: /usr/bin/ceph-osd(+0xe8f2bc) [0x5557c7bbc2bc]
Nov 30 00:28:04 ceph-1 conmon[375101]:  16: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f79ad306ba3]
Nov 30 00:28:04 ceph-1 conmon[375101]:  17: /lib64/libpthread.so.0(+0x814a) [0x7f79adcb114a]
Nov 30 00:28:04 ceph-1 conmon[375101]:  18: clone()
Nov 30 00:28:04 ceph-1 conmon[375101]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


From the further investigation on this case, I could find a similar issue has been raised in upstream..[1]
[1] https://tracker.ceph.com/issues/49237


Version-Release number of selected component (if applicable):
ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)

Comment 15 errata-xmlrpc 2022-04-04 10:23:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1174


Note You need to log in before you can comment on or make changes to this bug.