Bug 2069720 - [DR] rbd_support: a schedule may get lost due to load vs add race
Summary: [DR] rbd_support: a schedule may get lost due to load vs add race
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD-Mirror
Version: 5.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 5.2
Assignee: Ilya Dryomov
QA Contact: Vasishta
Akash Raj
URL:
Whiteboard:
Depends On:
Blocks: 2067095 2102272
TreeView+ depends on / blocked
 
Reported: 2022-03-29 14:35 UTC by Scott Ostapovicz
Modified: 2022-08-09 17:38 UTC (History)
17 users (show)

Fixed In Version: ceph-16.2.8-52.el8cp
Doc Type: Bug Fix
Doc Text:
.Snapshot-based mirroring process no longer gets cancelled Previously, as a result of an internal race condition, the `rbd mirror snapshot schedule add` command would be cancelled out. The snapshot-based mirroring process for the affected image would not start, if no other existing schedules were applicable. With this release, the race condition is fixed and the snapshot-based mirroring process starts as expected.
Clone Of: 2067095
: 2099799 (view as bug list)
Environment:
Last Closed: 2022-08-09 17:37:39 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 56090 0 None None None 2022-06-17 14:32:20 UTC
Red Hat Issue Tracker RHCEPH-3887 0 None None None 2022-03-29 15:57:21 UTC
Red Hat Product Errata RHSA-2022:5997 0 None None None 2022-08-09 17:38:08 UTC

Comment 1 Josh Durgin 2022-03-29 15:24:14 UTC
Chris, can you take a look? It seems there are a number of rbd-mirror crashes with this backtrace:

    "assert_msg": "/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc: In function 'void librbd::ImageWatcher<ImageCtxT>::schedule_request_lock(bool, int) [with ImageCtxT = librbd::ImageCtx]' thread 7f6ccc123700 time 2022-03-26T15:39:31.399999+0000\n/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc: 580: FAILED ceph_assert(m_image_ctx.exclusive_lock && !m_image_ctx.exclusive_lock->is_lock_owner())\n",
    "assert_thread_name": "io_context_pool",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12c20) [0x7f6ce068ac20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f6ce124ad4f]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x276f18) [0x7f6ce124af18]",
        "(librbd::ImageWatcher<librbd::ImageCtx>::schedule_request_lock(bool, int)+0x3b6) [0x5617d288a596]",
        "(librbd::ImageWatcher<librbd::ImageCtx>::handle_request_lock(int)+0x486) [0x5617d288aae6]",
        "(librbd::image_watcher::NotifyLockOwner::finish(int)+0x2b) [0x5617d2a0f25b]",
        "(librbd::image_watcher::NotifyLockOwner::handle_notify(int)+0x9e4) [0x5617d2a10014]",
        "(Context::complete(int)+0xd) [0x5617d26e080d]",
        "(boost::asio::detail::completion_handler<boost::asio::detail::work_dispatcher<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x66) [0x5617d26e0ca6]",
        "(boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x85) [0x5617d2854435]",
        "/lib64/librados.so.2(+0xc12e2) [0x7f6cea87e2e2]",
        "/lib64/librados.so.2(+0xc6cea) [0x7f6cea883cea]",
        "/lib64/libstdc++.so.6(+0xc2ba3) [0x7f6cdf499ba3]",
        "/lib64/libpthread.so.0(+0x817a) [0x7f6ce068017a]",
        "clone()"
    ],

Comment 8 Scott Ostapovicz 2022-05-06 20:28:46 UTC
Done

Comment 15 Gopi 2022-07-01 04:33:29 UTC
Working as expected with latest build. Hence moving to verified state.

Comment 20 errata-xmlrpc 2022-08-09 17:37:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage Security, Bug Fix, and Enhancement Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5997


Note You need to log in before you can comment on or make changes to this bug.