Bug 2069720

Summary:	[DR] rbd_support: a schedule may get lost due to load vs add race
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Scott Ostapovicz <sostapov>
Component:	RBD-Mirror	Assignee:	Ilya Dryomov <idryomov>
Status:	CLOSED ERRATA	QA Contact:	Vasishta <vashastr>
Severity:	urgent	Docs Contact:	Akash Raj <akraj>
Priority:	unspecified
Version:	5.1	CC:	akraj, asriram, bniver, ceph-eng-bugs, choffman, idryomov, jdurgin, kramdoss, kseeger, madam, mmuench, muagarwa, ocs-bugs, prsurve, srangana, tserlin, vereddy
Target Milestone:	---
Target Release:	5.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-16.2.8-52.el8cp	Doc Type:	Bug Fix
Doc Text:	.Snapshot-based mirroring process no longer gets cancelled Previously, as a result of an internal race condition, the `rbd mirror snapshot schedule add` command would be cancelled out. The snapshot-based mirroring process for the affected image would not start, if no other existing schedules were applicable. With this release, the race condition is fixed and the snapshot-based mirroring process starts as expected.	Story Points:	---
Clone Of:	2067095
Clones:	2099799 (view as bug list)		Environment:
Last Closed:	2022-08-09 17:37:39 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2067095, 2102272

Comment 1 Josh Durgin 2022-03-29 15:24:14 UTC

Chris, can you take a look? It seems there are a number of rbd-mirror crashes with this backtrace:

    "assert_msg": "/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc: In function 'void librbd::ImageWatcher<ImageCtxT>::schedule_request_lock(bool, int) [with ImageCtxT = librbd::ImageCtx]' thread 7f6ccc123700 time 2022-03-26T15:39:31.399999+0000\n/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc: 580: FAILED ceph_assert(m_image_ctx.exclusive_lock && !m_image_ctx.exclusive_lock->is_lock_owner())\n",
    "assert_thread_name": "io_context_pool",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12c20) [0x7f6ce068ac20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f6ce124ad4f]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x276f18) [0x7f6ce124af18]",
        "(librbd::ImageWatcher<librbd::ImageCtx>::schedule_request_lock(bool, int)+0x3b6) [0x5617d288a596]",
        "(librbd::ImageWatcher<librbd::ImageCtx>::handle_request_lock(int)+0x486) [0x5617d288aae6]",
        "(librbd::image_watcher::NotifyLockOwner::finish(int)+0x2b) [0x5617d2a0f25b]",
        "(librbd::image_watcher::NotifyLockOwner::handle_notify(int)+0x9e4) [0x5617d2a10014]",
        "(Context::complete(int)+0xd) [0x5617d26e080d]",
        "(boost::asio::detail::completion_handler<boost::asio::detail::work_dispatcher<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x66) [0x5617d26e0ca6]",
        "(boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x85) [0x5617d2854435]",
        "/lib64/librados.so.2(+0xc12e2) [0x7f6cea87e2e2]",
        "/lib64/librados.so.2(+0xc6cea) [0x7f6cea883cea]",
        "/lib64/libstdc++.so.6(+0xc2ba3) [0x7f6cdf499ba3]",
        "/lib64/libpthread.so.0(+0x817a) [0x7f6ce068017a]",
        "clone()"
    ],

Comment 8 Scott Ostapovicz 2022-05-06 20:28:46 UTC

Done

Comment 15 Gopi 2022-07-01 04:33:29 UTC

Working as expected with latest build. Hence moving to verified state.

Comment 20 errata-xmlrpc 2022-08-09 17:37:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage Security, Bug Fix, and Enhancement Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5997