Bug 2276498 - [7.0 RGW-Multisite]: Crash observed in boost::asio module
Summary: [7.0 RGW-Multisite]: Crash observed in boost::asio module
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.1
Assignee: Matt Benjamin (redhat)
QA Contact: Tejas
URL:
Whiteboard:
Depends On: 2275284
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-04-22 17:16 UTC by Matt Benjamin (redhat)
Modified: 2024-06-13 14:32 UTC (History)
9 users (show)

Fixed In Version: ceph-18.2.1-155.el9cp
Doc Type: No Doc Update
Doc Text:
Clone Of: 2275284
Environment:
Last Closed: 2024-06-13 14:32:07 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-8864 0 None None None 2024-04-22 17:17:05 UTC
Red Hat Product Errata RHSA-2024:3925 0 None None None 2024-06-13 14:32:10 UTC

Description Matt Benjamin (redhat) 2024-04-22 17:16:16 UTC
+++ This bug was initially created as a clone of Bug #2275284 +++

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Tejas on 2024-04-16 13:02:00 UTC ---

On ceph version 18.2.0-189, we observed a crash in boost asio. We are seeing this in our automated suite during object upload on secondary of multisite.
This is very intermittent and we have observed this only once.

2024-04-13 16:13:05,583 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'2024-04-13 06:43:05,480 INFO: cmd excuted'
2024-04-13 16:13:05,583 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'2024-04-13 06:43:05,480 INFO: {'
2024-04-13 16:13:05,584 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "backtrace": ['
2024-04-13 16:13:05,584 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'        "/lib64/libc.so.6(+0x54db0) [0x7fd314053db0]",'
2024-04-13 16:13:05,585 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'        "/usr/bin/radosgw(+0x33b8ea) [0x55783b4ae8ea]",'
2024-04-13 16:13:05,586 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'        "/usr/bin/radosgw(+0x35ba27) [0x55783b4cea27]",'
2024-04-13 16:13:05,586 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'        "(boost::asio::detail::executor_op<boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::system::error_code, unsigned long>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x1d2) [0x55783b4ef882]",'
2024-04-13 16:13:05,587 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'        "/usr/bin/radosgw(+0x3807de) [0x55783b4f37de]",'
2024-04-13 16:13:05,588 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'        "/usr/b'
2024-04-13 16:13:05,588 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'in/radosgw(+0x379910) [0x55783b4ec910]",'
2024-04-13 16:13:05,589 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'        "(boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x6a6) [0x55783b4dda06]",'
2024-04-13 16:13:05,589 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'        "/usr/bin/radosgw(+0xb8534e) [0x55783bcf834e]",'
2024-04-13 16:13:05,590 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'        "/usr/bin/radosgw(+0x3cf04d) [0x55783b54204d]",'
2024-04-13 16:13:05,591 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'        "/lib64/libstdc++.so.6(+0xdb924) [0x7fd3143db924]",'
2024-04-13 16:13:05,591 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'        "/lib64/libc.so.6(+0x9f802) [0x7fd31409e802]",'
2024-04-13 16:13:05,592 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'        "/lib64/libc.so.6(+0x3f450) [0x7fd31403e450]"'
2024-04-13 16:13:05,592 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    ],'
2024-04-13 16:13:05,593 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "ceph_version": '
2024-04-13 16:13:05,594 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'"18.2.0-189.el9cp",'
2024-04-13 16:13:05,594 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "crash_id": "2024-04-13T09:51:32.289835Z_ff6c6254-adcd-4890-b899-e4cc68668243",'
2024-04-13 16:13:05,595 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "entity_name": "client.rgw.shared.sec.ceph-sec-weekly-kjycov-jhx0gm-node7.tvuoct",'
2024-04-13 16:13:05,595 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "os_id": "rhel",'
2024-04-13 16:13:05,596 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "os_name": "Red Hat Enterprise Linux",'
2024-04-13 16:13:05,596 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "os_version": "9.3 (Plow)",'
2024-04-13 16:13:05,597 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "os_version_id": "9.3",'
2024-04-13 16:13:05,598 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "process_name": "radosgw",'
2024-04-13 16:13:05,598 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "stack_sig": "80017aa0b43c40d848b53eea81ab8d129357f64992a2007787d39491f906efbc",'
2024-04-13 16:13:05,599 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "timestamp": "2024-04-13T09:51:32.289835Z",'
2024-04-13 16:13:05,599 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "utsname_hostname": "ceph-sec-weekly-kjycov-jhx0gm-node7",'
2024-04-13 16:13:05,600 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "utsname_machine": "x86_64",'
2024-04-13 16:13:05,600 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "utsname_release": "5.14.0-362.24.1.el9_3.x86_64",'
2024-04-13 16:13:05,601 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "utsname_sysname": "Linux",'
2024-04-13 16:13:05,602 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Feb 15 07:18:13 EST 2024"'
2024-04-13 16:13:05,602 (cephci.sanity_rgw_multisite) [DEBUG] - cephci.RH.7.0.rhel-9.Weekly.18.2.0-189.rgw.37.cephci.ceph.ceph.py:1528 - b'}'

--- Additional comment from Tejas on 2024-04-16 13:03:44 UTC ---

Targeting this BZ to 7.0-z3

--- Additional comment from Matt Benjamin (redhat) on 2024-04-16 13:10:27 UTC ---

Tejas,

Why do you think it is appropriate to ignore a crash found in testing?

Matt

--- Additional comment from Tejas on 2024-04-16 13:13:44 UTC ---

(In reply to Matt Benjamin (redhat) from comment #3)
> Tejas,
> 
> Why do you think it is appropriate to ignore a crash found in testing?
> 
> Matt

Hi Matt,
We are trying to repro this issue, but we have only seen it once till now, and apart from the BT we do not have any other logs. So I created this BZ, but not blocking 7.0-z2.

--- Additional comment from Tejas on 2024-04-18 04:11:04 UTC ---

Quick update on this BZ ,
We have seen this crash in only the original 1 out of 4 runs so far , we have tried running this on 18.2.0-189, 18.2.0-190 and 18.2.0-191. We will continue to repro this.

--- Additional comment from Mark Kogan on 2024-04-18 16:37:20 UTC ---

@Tejas, Hi,

is there information available regarding what was the workload when the crash occurred?
(is this a MS synching RGW, or Client RGW or both?)
(is there haproxy in the environment?)

are there RGW logs?

Seem to remember Casey noted that the crash occurred on shutdown:
> ... boost::asio::ssl::detail::shutdown_op ...

could this during RGW shutdown?

and the `...asio::ssl...` hints at SSL was being used - could it be MS syncing using SSL, or the SSL was used for client ops only?



Cleanup the call stack above for readability:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2024-04-13 06:43:05,480 INFO: cmd excuted
2024-04-13 06:43:05,480 INFO: {'
    "backtrace": ['
        "/lib64/libc.so.6(+0x54db0) [0x7fd314053db0]",'
        "/usr/bin/radosgw(+0x33b8ea) [0x55783b4ae8ea]",'
        "/usr/bin/radosgw(+0x35ba27) [0x55783b4cea27]",'
        "(boost::asio::detail::executor_op<boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::system::error_code, unsigned long>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x1d2) [0x55783b4ef882]",'
        "/usr/bin/radosgw(+0x3807de) [0x55783b4f37de]",'
        "/usr/bin/radosgw(+0x379910) [0x55783b4ec910]",'
        "(boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x6a6) [0x55783b4dda06]",'
        "/usr/bin/radosgw(+0xb8534e) [0x55783bcf834e]",'
        "/usr/bin/radosgw(+0x3cf04d) [0x55783b54204d]",'
        "/lib64/libstdc++.so.6(+0xdb924) [0x7fd3143db924]",'
        "/lib64/libc.so.6(+0x9f802) [0x7fd31409e802]",'
        "/lib64/libc.so.6(+0x3f450) [0x7fd31403e450]"'
    ],'
    "ceph_version": '
"18.2.0-189.el9cp",'
    "crash_id": "2024-04-13T09:51:32.289835Z_ff6c6254-adcd-4890-b899-e4cc68668243",'
    "entity_name": "client.rgw.shared.sec.ceph-sec-weekly-kjycov-jhx0gm-node7.tvuoct",'
    "os_id": "rhel",'
    "os_name": "Red Hat Enterprise Linux",'
    "os_version": "9.3 (Plow)",'
    "os_version_id": "9.3",'
    "process_name": "radosgw",'
    "stack_sig": "80017aa0b43c40d848b53eea81ab8d129357f64992a2007787d39491f906efbc",'
    "timestamp": "2024-04-13T09:51:32.289835Z",'
    "utsname_hostname": "ceph-sec-weekly-kjycov-jhx0gm-node7",'
    "utsname_machine": "x86_64",'
    "utsname_release": "5.14.0-362.24.1.el9_3.x86_64",'
    "utsname_sysname": "Linux",'
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Feb 15 07:18:13 EST 2024"'
}'

--- Additional comment from Casey Bodley on 2024-04-18 17:06:43 UTC ---

thanks Mark,

> ... boost::asio::ssl::detail::shutdown_op ...

this shutdown_op is part of the stream.async_shutdown() call which you recently changed in https://github.com/ceph/ceph/pull/55967

--- Additional comment from Tejas on 2024-04-18 18:20:32 UTC ---

hi Mark,

this is the only log we have : http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/openstack/RH/7.0/rhel-9/Weekly/18.2.0-189/rgw/37/tier-2_ssl_rgw_ms_ecpool_test/enabling_bucket_versioning_and_uploading_objects_on_secondary_0.log

As the test states we enable versioning and do  some writes on the secondary of multisite , from the log just before the crash we see MS waiting to sync , there are no shutdown operations done.
I think we have only 1 RGW daemon which handles both IO and sync, and there is no haproxy configured.
Hope this helps.

--- Additional comment from Mark Kogan on 2024-04-22 09:20:37 UTC ---

completing the missing callstack symbols using addr2line:

"backtrace": ['
"/lib64/libc.so.6(+0x54db0) [0x7fd314053db0]",'

"/usr/bin/radosgw(+0x33b8ea) [0x55783b4ae8ea]",'
0x000000000033b8ea: boost::asio::detail::epoll_reactor::start_op(int, int, boost::asio::detail::epoll_reactor::descriptor_state*&, boost::asio::detail::reactor_op*, bool, bool) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/epoll_reactor.ipp:246:3

"/usr/bin/radosgw(+0x35ba27) [0x55783b4cea27]",'
0x000000000035ba27: boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >::operator()(boost::system::error_code, unsigned long, int) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/reactive_socket_service_base.hpp:419:13

"(boost::asio::detail::executor_op<boost::asio::detail::binder2<boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::system::error_code, unsigned long>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x1d2) [0x55783b4ef882]",'

"/usr/bin/radosgw(+0x3807de) [0x55783b4f37de]",'
0x00000000003807de: boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>::operator()() at /usr/include/c++/11/bits/shared_ptr_base.h:1296:16

"/usr/bin/radosgw(+0x379910) [0x55783b4ec910]",'
0x0000000000379910: void boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul>::execute<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void> >(boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 4ul> const, void>&&) const at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/impl/io_context.hpp:300:3

"(boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::ssl::detail::io_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, boost::asio::ssl::detail::shutdown_op, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >, void> >, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x6a6) [0x55783b4dda06]",'

"/usr/bin/radosgw(+0xb8534e) [0x55783bcf834e]",'
0x0000000000b8534e: boost::asio::detail::thread_info_base::rethrow_pending_exception() at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/thread_info_base.hpp:228:5
 (inlined by) boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:493:46
 (inlined by) boost::asio::detail::scheduler::run(boost::system::error_code&) [clone .constprop.0] [clone .isra.0] at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:210:20

"/usr/bin/radosgw(+0x3cf04d) [0x55783b54204d]",'
0x00000000003cf04d: std::thread::_State_impl<std::thread::_Invoker<std::tuple<(anonymous namespace)::AsioFrontend::run()::{lambda()#2}> > >::_M_run() [clone .lto_priv.0] at /usr/src/debug/ceph-18.2.0-189.el9cp.x86_64/redhat-linux-build/boost/include/boost/system/detail/error_code.hpp:305:13

"/lib64/libstdc++.so.6(+0xdb924) [0x7fd3143db924]",'
"/lib64/libc.so.6(+0x9f802) [0x7fd31409e802]",'
"/lib64/libc.so.6(+0x3f450) [0x7fd31403e450]"'

--- Additional comment from Mark Kogan on 2024-04-22 10:08:07 UTC ---

Hi Tejas, Casey,

@Cassey
> this shutdown_op is part of the stream.async_shutdown()
question reg https://github.com/ceph/ceph/pull/55967
before the change in the commit, the stream.async_shutdown() was called in case there was an error condition, 
after the change stream.async_shutdown() is being called when there is no error condition.
could calling `stream.async_shutdown(yield[ec])` when there is no error be risky because of `yield[ec]`?
or maybe we should not call `stream.async_shutdown(yield[ec])` when there is an error, and we have not encountered it before because errors are rare.

@Tejas
asking please if possible to try repro with `man 8 tc` by configuring packet loss on the RGW SSL endpoint

PS
> 1 RGW daemon which handles both IO and sync
question to narrow - the 1 RGW has both HTTP and HTTPS endpoints and was the sync using HTTP 
or both client and sync using HTTPS?

--- Additional comment from Tejas on 2024-04-22 10:29:23 UTC ---

(In reply to Mark Kogan from comment #10)

> @Tejas
> asking please if possible to try repro with `man 8 tc` by configuring packet
> loss on the RGW SSL endpoint

hi Mark,
  I saw the crash again on my 7th run, but we have no mechanism to enable coredump in CI prior to run start , so now Im trying to repro enabling coredump on the same VMs , but no luck so far. Sure I can configure packet loss too.

> 
> PS
> > 1 RGW daemon which handles both IO and sync
> question to narrow - the 1 RGW has both HTTP and HTTPS endpoints and was the
> sync using HTTP 
> or both client and sync using HTTPS?

HTTPS is being used for both client and sync.

--- Additional comment from Matt Benjamin (redhat) on 2024-04-22 11:55:44 UTC ---

thanks for the great debugging work, mark!

Matt

--- Additional comment from Casey Bodley on 2024-04-22 13:39:24 UTC ---

(In reply to Mark Kogan from comment #10)
> Hi Tejas, Casey,
> 
> @Cassey
> > this shutdown_op is part of the stream.async_shutdown()
> question reg https://github.com/ceph/ceph/pull/55967
> before the change in the commit, the stream.async_shutdown() was called in
> case there was an error condition,

i think you have this part backwards. the call had previously been wrapped in a `if (!ec) {` block which means there was no error

> after the change stream.async_shutdown() is being called when there is no
> error condition.
> could calling `stream.async_shutdown(yield[ec])` when there is no error be
> risky because of `yield[ec]`?
> or maybe we should not call `stream.async_shutdown(yield[ec])` when there is
> an error, and we have not encountered it before because errors are rare.

errors here are common because of http keepalive. the server keeps trying to read more requests from the client until the client hangs up, where the server sees errors like ECONNRESET

calling SSL_shutdown() on a closed socket should just return an error. it's very strange that we'd see crashes here unless the ssl::stream is being freed prematurely somehow

do we have the rgw debug logs leading up to the crash? can we see which asio error led up to this call? i'd look for errors logged by "write_data failed", "failed to read body", "failed to read header", "failed to discard unread message"

--- Additional comment from Matt Benjamin (redhat) on 2024-04-22 16:18:44 UTC ---

ok, returning this bz to 7.0z2 to state that we have reverted d96dcaf9fd7ef5530bddebb07f804049c840d87e


commit d96dcaf9fd7ef5530bddebb07f804049c840d87e (HEAD -> ceph-7.0-rhel-patches, rhgitlab/ceph-7.0-rhel-patches)
Author: matt benjamin <mbenjamin>
Date:   Mon Apr 22 10:21:38 2024 -0400

    Revert "rgw/beast: enablment of SSL session-id reuse speedup mechanism"
    
    This reverts commit f4025ea816ea0fcc3de0ddb6dd41b3b1fba76d29.
    
    Reverting to remove crash linked to this commit.
    
    Resolves: rhbz#2275284

Comment 8 errata-xmlrpc 2024-06-13 14:32:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925


Note You need to log in before you can comment on or make changes to this bug.