Description of problem (please be detailed as possible and provide log snippests): [DR] rbd-mirror daemon crashed Version of all relevant components (if applicable): bash-4.4$ ceph versions { "mon": { "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 1 }, "osd": { "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 3 }, "mds": { "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 2 }, "rbd-mirror": { "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 1 }, "rgw": { "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 1 }, "overall": { "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 11 } } ODF version:- full_version=4.10.0-156 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy DR cluster 2. run workloads have around 200+ pods,PVC, Actual results: bash-4.4$ ceph crash info 2022-02-17T00:30:51.642803Z_85a947d4-62ba-45a7-8219-2f453c4036fd { "assert_condition": "m_image_ctx.exclusive_lock && !m_image_ctx.exclusive_lock->is_lock_owner()", "assert_file": "/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc", "assert_func": "void librbd::ImageWatcher<ImageCtxT>::schedule_request_lock(bool, int) [with ImageCtxT = librbd::ImageCtx]", "assert_line": 580, "assert_msg": "/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc: In function 'void librbd::ImageWatcher<ImageCtxT>::schedule_request_lock(bool, int) [with ImageCtxT = librbd::ImageCtx]' thread 7f1cfce85700 time 2022-02-17T00:30:51.636845+0000\n/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc: 580: FAILED ceph_assert(m_image_ctx.exclusive_lock && !m_image_ctx.exclusive_lock->is_lock_owner())\n", "assert_thread_name": "io_context_pool", "backtrace": [ "/lib64/libpthread.so.0(+0x12c20) [0x7f1d113ecc20]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f1d11facd43]", "/usr/lib64/ceph/libceph-common.so.2(+0x276f0c) [0x7f1d11facf0c]", "(librbd::ImageWatcher<librbd::ImageCtx>::schedule_request_lock(bool, int)+0x3b6) [0x565476e95fe6]", "(librbd::ImageWatcher<librbd::ImageCtx>::handle_request_lock(int)+0x486) [0x565476e96536]", "(librbd::image_watcher::NotifyLockOwner::finish(int)+0x2b) [0x56547701aa5b]", "(librbd::image_watcher::NotifyLockOwner::handle_notify(int)+0x9e4) [0x56547701b814]", "(Context::complete(int)+0xd) [0x565476cec80d]", "(boost::asio::detail::completion_handler<boost::asio::detail::work_dispatcher<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x66) [0x565476cecca6]", "(boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x85) [0x565476e5fea5]", "/lib64/librados.so.2(+0xc12e2) [0x7f1d1b5e02e2]", "/lib64/librados.so.2(+0xc6cea) [0x7f1d1b5e5cea]", "/lib64/libstdc++.so.6(+0xc2ba3) [0x7f1d101fbba3]", "/lib64/libpthread.so.0(+0x817a) [0x7f1d113e217a]", "clone()" ], "ceph_version": "16.2.7-49.el8cp", "crash_id": "2022-02-17T00:30:51.642803Z_85a947d4-62ba-45a7-8219-2f453c4036fd", "entity_name": "client.rbd-mirror.a", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "8.5 (Ootpa)", "os_version_id": "8.5", "process_name": "rbd-mirror", "stack_sig": "dac8ee138a54782e0e75dd65d50f017b0131db6c412d5724c38c666acc2ff9b5", "timestamp": "2022-02-17T00:30:51.642803Z", "utsname_hostname": "rook-ceph-rbd-mirror-a-fb7fcb66f-j48ql", "utsname_machine": "x86_64", "utsname_release": "4.18.0-305.34.2.el8_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Mon Jan 17 09:42:23 EST 2022" } Expected results: There should not be any crash Additional info:
Ilya, Please take a look.
Moving DR BZs to 4.10.z/4.11
The net conclusion on the blocker status was that this might be a blocker for GA, but it is not a blocker for a tech preview.
Ilya, do we have plans to fix this issue in Ceph 5.2? If yes, can we please create a ceph bz?
Thanks, moving it to 4.12 then.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days