.No crash is expected when using the Ceph messenger instance during disconnection
A Ceph messenger instance maintains an internal registry of connections that can be accessed from multiple threads. These multi-threaded accesses require proper synchronization.
Previously, when a connection was being unregistered, for example, disconnected, unsynchronized access would happen to result in the crash of the messenger’s user, such as a Ceph OSD daemon.
This release implements proper synchronization at the unregister stage and therefore, no crashes are expected.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2022:1174
Description of problem: The osd daemon got crashed once in ceph version 16.2.0-117.el8cp(RHCS 5.0) in thread 7f79a8cc0700 thread_name:msgr-worker-2. The crash info: { "backtrace": [ "/lib64/libpthread.so.0(+0x12b20) [0x7f79adcbbb20]", "(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x2c) [0x5557c7d6cacc]", "(AsyncConnection::_stop()+0xab) [0x5557c7d66c7b]", "(ProtocolV2::stop()+0x8f) [0x5557c7d91d5f]", "(ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> const&)+0x742) [0x5557c7da74a2]", "(ProtocolV2::handle_client_ident(ceph::buffer::v15_2_0::list&)+0xeef) [0x5557c7da8d3f]", "(ProtocolV2::handle_frame_payload()+0x20b) [0x5557c7da934b]", "(ProtocolV2::handle_read_frame_dispatch()+0x160) [0x5557c7da95d0]", "(ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x5557c7da97c5]", "(ProtocolV2::_handle_read_frame_segment()+0x92) [0x5557c7da9872]", "(ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x201) [0x5557c7daa9c1]", "(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x5557c7d92bfc]", "(AsyncConnection::process()+0x789) [0x5557c7d69d19]", "(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x5557c7bb8797]", "/usr/bin/ceph-osd(+0xe8f2bc) [0x5557c7bbc2bc]", "/lib64/libstdc++.so.6(+0xc2ba3) [0x7f79ad306ba3]", "/lib64/libpthread.so.0(+0x814a) [0x7f79adcb114a]", "clone()" ], "ceph_version": "16.2.0-117.el8cp", "crash_id": "2021-11-29T23:28:04.044560Z_c336b554-0ea2-4adc-88e7-a595182acb5e", "entity_name": "osd.1", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "8.4 (Ootpa)", "os_version_id": "8.4", "process_name": "ceph-osd", "stack_sig": "f0a30000aaf2ae26cfc68aa3a57d6101fd483063ed20b285a94140185b036bff", "timestamp": "2021-11-29T23:28:04.044560Z", "utsname_hostname": "ceph-1", "utsname_machine": "x86_64", "utsname_release": "4.18.0-240.el8.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Wed Sep 23 05:13:10 EDT 2020" } We could find the same details from the journalctl of the osd daemon which was crashed. The journalctl says: Nov 30 00:28:04 ceph-1 conmon[375101]: *** Caught signal (Segmentation fault) ** Nov 30 00:28:04 ceph-1 conmon[375101]: in thread 7f79a8cc0700 thread_name:msgr-worker-2 Nov 30 00:28:04 ceph-1 conmon[375101]: ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable) Nov 30 00:28:04 ceph-1 conmon[375101]: 1: /lib64/libpthread.so.0(+0x12b20) [0x7f79adcbbb20] Nov 30 00:28:04 ceph-1 conmon[375101]: 2: (std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x2c) [0x5557c7d6cacc] Nov 30 00:28:04 ceph-1 conmon[375101]: 3: (AsyncConnection::_stop()+0xab) [0x5557c7d66c7b] Nov 30 00:28:04 ceph-1 conmon[375101]: 4: (ProtocolV2::stop()+0x8f) [0x5557c7d91d5f] Nov 30 00:28:04 ceph-1 conmon[375101]: 5: (ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> const&)+0x742) [0x5557c7da74a2] Nov 30 00:28:04 ceph-1 conmon[375101]: 6: (ProtocolV2::handle_client_ident(ceph::buffer::v15_2_0::list&)+0xeef) [0x5557c7da8d3f] Nov 30 00:28:04 ceph-1 conmon[375101]: 7: (ProtocolV2::handle_frame_payload()+0x20b) [0x5557c7da934b] Nov 30 00:28:04 ceph-1 conmon[375101]: 8: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x5557c7da95d0] Nov 30 00:28:04 ceph-1 conmon[375101]: 9: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x5557c7da97c5] Nov 30 00:28:04 ceph-1 conmon[375101]: 10: (ProtocolV2::_handle_read_frame_segment()+0x92) [0x5557c7da9872] Nov 30 00:28:04 ceph-1 conmon[375101]: 11: (ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x201) [0x5557c7daa9c1] Nov 30 00:28:04 ceph-1 conmon[375101]: 12: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x5557c7d92bfc] Nov 30 00:28:04 ceph-1 conmon[375101]: 13: (AsyncConnection::process()+0x789) [0x5557c7d69d19] Nov 30 00:28:04 ceph-1 conmon[375101]: 14: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x5557c7bb8797] Nov 30 00:28:04 ceph-1 conmon[375101]: 15: /usr/bin/ceph-osd(+0xe8f2bc) [0x5557c7bbc2bc] Nov 30 00:28:04 ceph-1 conmon[375101]: 16: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f79ad306ba3] Nov 30 00:28:04 ceph-1 conmon[375101]: 17: /lib64/libpthread.so.0(+0x814a) [0x7f79adcb114a] Nov 30 00:28:04 ceph-1 conmon[375101]: 18: clone() Nov 30 00:28:04 ceph-1 conmon[375101]: debug 2021-11-29T23:28:04.045+0000 7f79a8cc0700 -1 *** Caught signal (Segmentation fault) ** Nov 30 00:28:04 ceph-1 conmon[375101]: in thread 7f79a8cc0700 thread_name:msgr-worker-2 Nov 30 00:28:04 ceph-1 conmon[375101]: Nov 30 00:28:04 ceph-1 conmon[375101]: ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable) Nov 30 00:28:04 ceph-1 conmon[375101]: 1: /lib64/libpthread.so.0(+0x12b20) [0x7f79adcbbb20] Nov 30 00:28:04 ceph-1 conmon[375101]: 2: (std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x2c) [0x5557c7d6cacc] Nov 30 00:28:04 ceph-1 conmon[375101]: 3: (AsyncConnection::_stop()+0xab) [0x5557c7d66c7b] Nov 30 00:28:04 ceph-1 conmon[375101]: 4: (ProtocolV2::stop()+0x8f) [0x5557c7d91d5f] Nov 30 00:28:04 ceph-1 conmon[375101]: 5: (ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> const&)+0x742) [0x5557c7da74a2] Nov 30 00:28:04 ceph-1 conmon[375101]: 6: (ProtocolV2::handle_client_ident(ceph::buffer::v15_2_0::list&)+0xeef) [0x5557c7da8d3f] Nov 30 00:28:04 ceph-1 conmon[375101]: 7: (ProtocolV2::handle_frame_payload()+0x20b) [0x5557c7da934b] Nov 30 00:28:04 ceph-1 conmon[375101]: 8: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x5557c7da95d0] Nov 30 00:28:04 ceph-1 conmon[375101]: 9: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x5557c7da97c5] Nov 30 00:28:04 ceph-1 conmon[375101]: 10: (ProtocolV2::_handle_read_frame_segment()+0x92) [0x5557c7da9872] Nov 30 00:28:04 ceph-1 conmon[375101]: 11: (ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x201) [0x5557c7daa9c1] Nov 30 00:28:04 ceph-1 conmon[375101]: 12: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x5557c7d92bfc] Nov 30 00:28:04 ceph-1 conmon[375101]: 13: (AsyncConnection::process()+0x789) [0x5557c7d69d19] Nov 30 00:28:04 ceph-1 conmon[375101]: 14: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x5557c7bb8797] Nov 30 00:28:04 ceph-1 conmon[375101]: 15: /usr/bin/ceph-osd(+0xe8f2bc) [0x5557c7bbc2bc] Nov 30 00:28:04 ceph-1 conmon[375101]: 16: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f79ad306ba3] Nov 30 00:28:04 ceph-1 conmon[375101]: 17: /lib64/libpthread.so.0(+0x814a) [0x7f79adcb114a] Nov 30 00:28:04 ceph-1 conmon[375101]: 18: clone() Nov 30 00:28:04 ceph-1 conmon[375101]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. From the further investigation on this case, I could find a similar issue has been raised in upstream..[1] [1] https://tracker.ceph.com/issues/49237 Version-Release number of selected component (if applicable): ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)