Bug 2069720

Summary: [DR] rbd_support: a schedule may get lost due to load vs add race
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Scott Ostapovicz <sostapov>
Component: RBD-MirrorAssignee: Ilya Dryomov <idryomov>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: urgent Docs Contact: Akash Raj <akraj>
Priority: unspecified    
Version: 5.1CC: akraj, asriram, bniver, ceph-eng-bugs, choffman, idryomov, jdurgin, kramdoss, kseeger, madam, mmuench, muagarwa, ocs-bugs, prsurve, srangana, tserlin, vereddy
Target Milestone: ---   
Target Release: 5.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-16.2.8-52.el8cp Doc Type: Bug Fix
Doc Text:
.Snapshot-based mirroring process no longer gets cancelled Previously, as a result of an internal race condition, the `rbd mirror snapshot schedule add` command would be cancelled out. The snapshot-based mirroring process for the affected image would not start, if no other existing schedules were applicable. With this release, the race condition is fixed and the snapshot-based mirroring process starts as expected.
Story Points: ---
Clone Of: 2067095
: 2099799 (view as bug list) Environment:
Last Closed: 2022-08-09 17:37:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2067095, 2102272    

Comment 1 Josh Durgin 2022-03-29 15:24:14 UTC
Chris, can you take a look? It seems there are a number of rbd-mirror crashes with this backtrace:

    "assert_msg": "/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc: In function 'void librbd::ImageWatcher<ImageCtxT>::schedule_request_lock(bool, int) [with ImageCtxT = librbd::ImageCtx]' thread 7f6ccc123700 time 2022-03-26T15:39:31.399999+0000\n/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc: 580: FAILED ceph_assert(m_image_ctx.exclusive_lock && !m_image_ctx.exclusive_lock->is_lock_owner())\n",
    "assert_thread_name": "io_context_pool",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12c20) [0x7f6ce068ac20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f6ce124ad4f]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x276f18) [0x7f6ce124af18]",
        "(librbd::ImageWatcher<librbd::ImageCtx>::schedule_request_lock(bool, int)+0x3b6) [0x5617d288a596]",
        "(librbd::ImageWatcher<librbd::ImageCtx>::handle_request_lock(int)+0x486) [0x5617d288aae6]",
        "(librbd::image_watcher::NotifyLockOwner::finish(int)+0x2b) [0x5617d2a0f25b]",
        "(librbd::image_watcher::NotifyLockOwner::handle_notify(int)+0x9e4) [0x5617d2a10014]",
        "(Context::complete(int)+0xd) [0x5617d26e080d]",
        "(boost::asio::detail::completion_handler<boost::asio::detail::work_dispatcher<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x66) [0x5617d26e0ca6]",
        "(boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x85) [0x5617d2854435]",
        "/lib64/librados.so.2(+0xc12e2) [0x7f6cea87e2e2]",
        "/lib64/librados.so.2(+0xc6cea) [0x7f6cea883cea]",
        "/lib64/libstdc++.so.6(+0xc2ba3) [0x7f6cdf499ba3]",
        "/lib64/libpthread.so.0(+0x817a) [0x7f6ce068017a]",
        "clone()"
    ],

Comment 8 Scott Ostapovicz 2022-05-06 20:28:46 UTC
Done

Comment 15 Gopi 2022-07-01 04:33:29 UTC
Working as expected with latest build. Hence moving to verified state.

Comment 20 errata-xmlrpc 2022-08-09 17:37:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage Security, Bug Fix, and Enhancement Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5997