Bug 2058223 - [Tracker for Ceph BZ #2102227] [DR] rbd-mirror daemon crashed
Summary: [Tracker for Ceph BZ #2102227] [DR] rbd-mirror daemon crashed
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.12.0
Assignee: Christopher Hoffman
QA Contact: Pratik Surve
URL:
Whiteboard:
Depends On: 2102227
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-24 14:28 UTC by Pratik Surve
Modified: 2023-12-08 04:27 UTC (History)
13 users (show)

Fixed In Version: 4.11.0-123
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2102227 (view as bug list)
Environment:
Last Closed: 2023-02-08 14:06:28 UTC
Embargoed:


Attachments (Terms of Use)

Description Pratik Surve 2022-02-24 14:28:55 UTC
Description of problem (please be detailed as possible and provide log
snippests):
[DR] rbd-mirror daemon crashed


Version of all relevant components (if applicable):
bash-4.4$ ceph versions
{
    "mon": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 3
    },
    "mds": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 2
    },
    "rbd-mirror": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 1
    },
    "rgw": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 11
    }
}

ODF version:- full_version=4.10.0-156


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy DR cluster 
2. run workloads have around 200+ pods,PVC,



Actual results:

bash-4.4$ ceph crash info 2022-02-17T00:30:51.642803Z_85a947d4-62ba-45a7-8219-2f453c4036fd
{
    "assert_condition": "m_image_ctx.exclusive_lock && !m_image_ctx.exclusive_lock->is_lock_owner()",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc",
    "assert_func": "void librbd::ImageWatcher<ImageCtxT>::schedule_request_lock(bool, int) [with ImageCtxT = librbd::ImageCtx]",
    "assert_line": 580,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc: In function 'void librbd::ImageWatcher<ImageCtxT>::schedule_request_lock(bool, int) [with ImageCtxT = librbd::ImageCtx]' thread 7f1cfce85700 time 2022-02-17T00:30:51.636845+0000\n/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc: 580: FAILED ceph_assert(m_image_ctx.exclusive_lock && !m_image_ctx.exclusive_lock->is_lock_owner())\n",
    "assert_thread_name": "io_context_pool",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12c20) [0x7f1d113ecc20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f1d11facd43]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x276f0c) [0x7f1d11facf0c]",
        "(librbd::ImageWatcher<librbd::ImageCtx>::schedule_request_lock(bool, int)+0x3b6) [0x565476e95fe6]",
        "(librbd::ImageWatcher<librbd::ImageCtx>::handle_request_lock(int)+0x486) [0x565476e96536]",
        "(librbd::image_watcher::NotifyLockOwner::finish(int)+0x2b) [0x56547701aa5b]",
        "(librbd::image_watcher::NotifyLockOwner::handle_notify(int)+0x9e4) [0x56547701b814]",
        "(Context::complete(int)+0xd) [0x565476cec80d]",
        "(boost::asio::detail::completion_handler<boost::asio::detail::work_dispatcher<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x66) [0x565476cecca6]",
        "(boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x85) [0x565476e5fea5]",
        "/lib64/librados.so.2(+0xc12e2) [0x7f1d1b5e02e2]",
        "/lib64/librados.so.2(+0xc6cea) [0x7f1d1b5e5cea]",
        "/lib64/libstdc++.so.6(+0xc2ba3) [0x7f1d101fbba3]",
        "/lib64/libpthread.so.0(+0x817a) [0x7f1d113e217a]",
        "clone()"
    ],
    "ceph_version": "16.2.7-49.el8cp",
    "crash_id": "2022-02-17T00:30:51.642803Z_85a947d4-62ba-45a7-8219-2f453c4036fd",
    "entity_name": "client.rbd-mirror.a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.5 (Ootpa)",
    "os_version_id": "8.5",
    "process_name": "rbd-mirror",
    "stack_sig": "dac8ee138a54782e0e75dd65d50f017b0131db6c412d5724c38c666acc2ff9b5",
    "timestamp": "2022-02-17T00:30:51.642803Z",
    "utsname_hostname": "rook-ceph-rbd-mirror-a-fb7fcb66f-j48ql",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.34.2.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Mon Jan 17 09:42:23 EST 2022"
}

Expected results:
There should not be any crash

Additional info:

Comment 2 Mudit Agarwal 2022-03-08 13:24:00 UTC
Ilya, Please take a look.

Comment 21 Mudit Agarwal 2022-04-05 13:46:06 UTC
Moving DR BZs to 4.10.z/4.11

Comment 26 Scott Ostapovicz 2022-04-14 15:50:07 UTC
The net conclusion on the blocker status was that this might be a blocker for GA, but it is not a blocker for a tech preview.

Comment 29 Mudit Agarwal 2022-06-29 13:35:35 UTC
Ilya, do we have plans to fix this issue in Ceph 5.2?
If yes, can we please create a ceph bz?

Comment 31 Mudit Agarwal 2022-06-29 13:55:51 UTC
Thanks, moving it to 4.12 then.

Comment 53 Red Hat Bugzilla 2023-12-08 04:27:51 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.