Bug 2029778

Summary: [cee/sd][rados] ceph-osd daemon crashed with Segmentation fault in thread 7f79a8cc0700 thread_name:msgr-worker-2
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Prasanth M V <pmv>
Component: RADOSAssignee: Radoslaw Zarzynski <rzarzyns>
Status: CLOSED ERRATA QA Contact: skanta
Severity: medium Docs Contact: Ranjini M N <rmandyam>
Priority: unspecified    
Version: 5.0CC: agunn, akupczyk, amathuri, bhubbard, ceph-eng-bugs, gjose, ksirivad, lflores, lithomas, mmuench, nojha, pdhange, rfriedma, rmandyam, rzarzyns, skanta, sseshasa, tserlin, vereddy, vumrao
Target Milestone: ---Keywords: CodeChange, Rebase
Target Release: 5.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-16.2.7-4.el8cp Doc Type: Bug Fix
Doc Text:
.No crash is expected when using the Ceph messenger instance during disconnection A Ceph messenger instance maintains an internal registry of connections that can be accessed from multiple threads. These multi-threaded accesses require proper synchronization. Previously, when a connection was being unregistered, for example, disconnected, unsynchronized access would happen to result in the crash of the messenger’s user, such as a Ceph OSD daemon. This release implements proper synchronization at the unregister stage and therefore, no crashes are expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-04 10:23:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2031073    

Description Prasanth M V 2021-12-07 10:00:44 UTC
Description of problem:

The osd daemon got crashed once in ceph version 16.2.0-117.el8cp(RHCS 5.0) in thread 7f79a8cc0700 thread_name:msgr-worker-2.

The crash info:
   {
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7f79adcbbb20]",
        "(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x2c) [0x5557c7d6cacc]",
        "(AsyncConnection::_stop()+0xab) [0x5557c7d66c7b]",
        "(ProtocolV2::stop()+0x8f) [0x5557c7d91d5f]",
        "(ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> const&)+0x742) [0x5557c7da74a2]",
        "(ProtocolV2::handle_client_ident(ceph::buffer::v15_2_0::list&)+0xeef) [0x5557c7da8d3f]",
        "(ProtocolV2::handle_frame_payload()+0x20b) [0x5557c7da934b]",
        "(ProtocolV2::handle_read_frame_dispatch()+0x160) [0x5557c7da95d0]",
        "(ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x5557c7da97c5]",
        "(ProtocolV2::_handle_read_frame_segment()+0x92) [0x5557c7da9872]",
        "(ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x201) [0x5557c7daa9c1]",
        "(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x5557c7d92bfc]",
        "(AsyncConnection::process()+0x789) [0x5557c7d69d19]",
        "(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x5557c7bb8797]",
        "/usr/bin/ceph-osd(+0xe8f2bc) [0x5557c7bbc2bc]",
        "/lib64/libstdc++.so.6(+0xc2ba3) [0x7f79ad306ba3]",
        "/lib64/libpthread.so.0(+0x814a) [0x7f79adcb114a]",
        "clone()"
    ],
    "ceph_version": "16.2.0-117.el8cp",
    "crash_id": "2021-11-29T23:28:04.044560Z_c336b554-0ea2-4adc-88e7-a595182acb5e",
    "entity_name": "osd.1",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.4 (Ootpa)",
    "os_version_id": "8.4",
    "process_name": "ceph-osd",
    "stack_sig": "f0a30000aaf2ae26cfc68aa3a57d6101fd483063ed20b285a94140185b036bff",
    "timestamp": "2021-11-29T23:28:04.044560Z",
    "utsname_hostname": "ceph-1",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-240.el8.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Wed Sep 23 05:13:10 EDT 2020"
}


We could find the same details from the journalctl of the osd daemon which was crashed. The journalctl says:

Nov 30 00:28:04 ceph-1 conmon[375101]: *** Caught signal (Segmentation fault) **
Nov 30 00:28:04 ceph-1 conmon[375101]:  in thread 7f79a8cc0700 thread_name:msgr-worker-2
Nov 30 00:28:04 ceph-1 conmon[375101]:  ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)
Nov 30 00:28:04 ceph-1 conmon[375101]:  1: /lib64/libpthread.so.0(+0x12b20) [0x7f79adcbbb20]
Nov 30 00:28:04 ceph-1 conmon[375101]:  2: (std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x2c) [0x5557c7d6cacc]
Nov 30 00:28:04 ceph-1 conmon[375101]:  3: (AsyncConnection::_stop()+0xab) [0x5557c7d66c7b]
Nov 30 00:28:04 ceph-1 conmon[375101]:  4: (ProtocolV2::stop()+0x8f) [0x5557c7d91d5f]
Nov 30 00:28:04 ceph-1 conmon[375101]:  5: (ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> const&)+0x742) [0x5557c7da74a2]
Nov 30 00:28:04 ceph-1 conmon[375101]:  6: (ProtocolV2::handle_client_ident(ceph::buffer::v15_2_0::list&)+0xeef) [0x5557c7da8d3f]
Nov 30 00:28:04 ceph-1 conmon[375101]:  7: (ProtocolV2::handle_frame_payload()+0x20b) [0x5557c7da934b]
Nov 30 00:28:04 ceph-1 conmon[375101]:  8: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x5557c7da95d0]
Nov 30 00:28:04 ceph-1 conmon[375101]:  9: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x5557c7da97c5]
Nov 30 00:28:04 ceph-1 conmon[375101]:  10: (ProtocolV2::_handle_read_frame_segment()+0x92) [0x5557c7da9872]
Nov 30 00:28:04 ceph-1 conmon[375101]:  11: (ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x201) [0x5557c7daa9c1]
Nov 30 00:28:04 ceph-1 conmon[375101]:  12: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x5557c7d92bfc]
Nov 30 00:28:04 ceph-1 conmon[375101]:  13: (AsyncConnection::process()+0x789) [0x5557c7d69d19]
Nov 30 00:28:04 ceph-1 conmon[375101]:  14: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x5557c7bb8797]
Nov 30 00:28:04 ceph-1 conmon[375101]:  15: /usr/bin/ceph-osd(+0xe8f2bc) [0x5557c7bbc2bc]
Nov 30 00:28:04 ceph-1 conmon[375101]:  16: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f79ad306ba3]
Nov 30 00:28:04 ceph-1 conmon[375101]:  17: /lib64/libpthread.so.0(+0x814a) [0x7f79adcb114a]
Nov 30 00:28:04 ceph-1 conmon[375101]:  18: clone()
Nov 30 00:28:04 ceph-1 conmon[375101]: debug 2021-11-29T23:28:04.045+0000 7f79a8cc0700 -1 *** Caught signal (Segmentation fault) **
Nov 30 00:28:04 ceph-1 conmon[375101]:  in thread 7f79a8cc0700 thread_name:msgr-worker-2
Nov 30 00:28:04 ceph-1 conmon[375101]:
Nov 30 00:28:04 ceph-1 conmon[375101]:  ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)
Nov 30 00:28:04 ceph-1 conmon[375101]:  1: /lib64/libpthread.so.0(+0x12b20) [0x7f79adcbbb20]
Nov 30 00:28:04 ceph-1 conmon[375101]:  2: (std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x2c) [0x5557c7d6cacc]
Nov 30 00:28:04 ceph-1 conmon[375101]:  3: (AsyncConnection::_stop()+0xab) [0x5557c7d66c7b]
Nov 30 00:28:04 ceph-1 conmon[375101]:  4: (ProtocolV2::stop()+0x8f) [0x5557c7d91d5f]
Nov 30 00:28:04 ceph-1 conmon[375101]:  5: (ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> const&)+0x742) [0x5557c7da74a2]
Nov 30 00:28:04 ceph-1 conmon[375101]:  6: (ProtocolV2::handle_client_ident(ceph::buffer::v15_2_0::list&)+0xeef) [0x5557c7da8d3f]
Nov 30 00:28:04 ceph-1 conmon[375101]:  7: (ProtocolV2::handle_frame_payload()+0x20b) [0x5557c7da934b]
Nov 30 00:28:04 ceph-1 conmon[375101]:  8: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x5557c7da95d0]
Nov 30 00:28:04 ceph-1 conmon[375101]:  9: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x5557c7da97c5]
Nov 30 00:28:04 ceph-1 conmon[375101]:  10: (ProtocolV2::_handle_read_frame_segment()+0x92) [0x5557c7da9872]
Nov 30 00:28:04 ceph-1 conmon[375101]:  11: (ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x201) [0x5557c7daa9c1]
Nov 30 00:28:04 ceph-1 conmon[375101]:  12: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x5557c7d92bfc]
Nov 30 00:28:04 ceph-1 conmon[375101]:  13: (AsyncConnection::process()+0x789) [0x5557c7d69d19]
Nov 30 00:28:04 ceph-1 conmon[375101]:  14: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x5557c7bb8797]
Nov 30 00:28:04 ceph-1 conmon[375101]:  15: /usr/bin/ceph-osd(+0xe8f2bc) [0x5557c7bbc2bc]
Nov 30 00:28:04 ceph-1 conmon[375101]:  16: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f79ad306ba3]
Nov 30 00:28:04 ceph-1 conmon[375101]:  17: /lib64/libpthread.so.0(+0x814a) [0x7f79adcb114a]
Nov 30 00:28:04 ceph-1 conmon[375101]:  18: clone()
Nov 30 00:28:04 ceph-1 conmon[375101]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


From the further investigation on this case, I could find a similar issue has been raised in upstream..[1]
[1] https://tracker.ceph.com/issues/49237


Version-Release number of selected component (if applicable):
ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)

Comment 15 errata-xmlrpc 2022-04-04 10:23:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1174