Bug 2058223

Summary: [Tracker for Ceph BZ #2102227] [DR] rbd-mirror daemon crashed
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Pratik Surve <prsurve>
Component: cephAssignee: Christopher Hoffman <choffman>
ceph sub component: RBD QA Contact: Pratik Surve <prsurve>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: urgent    
Priority: unspecified CC: amagrawa, bniver, ekuric, idryomov, kramdoss, mmuench, muagarwa, ocs-bugs, odf-bz-bot, pnataraj, sagrawal, sostapov, srangana
Version: 4.10   
Target Milestone: ---   
Target Release: ODF 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.11.0-123 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2102227 (view as bug list) Environment:
Last Closed: 2023-02-08 14:06:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2102227    
Bug Blocks:    

Description Pratik Surve 2022-02-24 14:28:55 UTC
Description of problem (please be detailed as possible and provide log
snippests):
[DR] rbd-mirror daemon crashed


Version of all relevant components (if applicable):
bash-4.4$ ceph versions
{
    "mon": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 3
    },
    "mds": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 2
    },
    "rbd-mirror": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 1
    },
    "rgw": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.7-49.el8cp (70af8286930223d22a515234e4724928641877bf) pacific (stable)": 11
    }
}

ODF version:- full_version=4.10.0-156


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy DR cluster 
2. run workloads have around 200+ pods,PVC,



Actual results:

bash-4.4$ ceph crash info 2022-02-17T00:30:51.642803Z_85a947d4-62ba-45a7-8219-2f453c4036fd
{
    "assert_condition": "m_image_ctx.exclusive_lock && !m_image_ctx.exclusive_lock->is_lock_owner()",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc",
    "assert_func": "void librbd::ImageWatcher<ImageCtxT>::schedule_request_lock(bool, int) [with ImageCtxT = librbd::ImageCtx]",
    "assert_line": 580,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc: In function 'void librbd::ImageWatcher<ImageCtxT>::schedule_request_lock(bool, int) [with ImageCtxT = librbd::ImageCtx]' thread 7f1cfce85700 time 2022-02-17T00:30:51.636845+0000\n/builddir/build/BUILD/ceph-16.2.7/src/librbd/ImageWatcher.cc: 580: FAILED ceph_assert(m_image_ctx.exclusive_lock && !m_image_ctx.exclusive_lock->is_lock_owner())\n",
    "assert_thread_name": "io_context_pool",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12c20) [0x7f1d113ecc20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f1d11facd43]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x276f0c) [0x7f1d11facf0c]",
        "(librbd::ImageWatcher<librbd::ImageCtx>::schedule_request_lock(bool, int)+0x3b6) [0x565476e95fe6]",
        "(librbd::ImageWatcher<librbd::ImageCtx>::handle_request_lock(int)+0x486) [0x565476e96536]",
        "(librbd::image_watcher::NotifyLockOwner::finish(int)+0x2b) [0x56547701aa5b]",
        "(librbd::image_watcher::NotifyLockOwner::handle_notify(int)+0x9e4) [0x56547701b814]",
        "(Context::complete(int)+0xd) [0x565476cec80d]",
        "(boost::asio::detail::completion_handler<boost::asio::detail::work_dispatcher<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x66) [0x565476cecca6]",
        "(boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x85) [0x565476e5fea5]",
        "/lib64/librados.so.2(+0xc12e2) [0x7f1d1b5e02e2]",
        "/lib64/librados.so.2(+0xc6cea) [0x7f1d1b5e5cea]",
        "/lib64/libstdc++.so.6(+0xc2ba3) [0x7f1d101fbba3]",
        "/lib64/libpthread.so.0(+0x817a) [0x7f1d113e217a]",
        "clone()"
    ],
    "ceph_version": "16.2.7-49.el8cp",
    "crash_id": "2022-02-17T00:30:51.642803Z_85a947d4-62ba-45a7-8219-2f453c4036fd",
    "entity_name": "client.rbd-mirror.a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.5 (Ootpa)",
    "os_version_id": "8.5",
    "process_name": "rbd-mirror",
    "stack_sig": "dac8ee138a54782e0e75dd65d50f017b0131db6c412d5724c38c666acc2ff9b5",
    "timestamp": "2022-02-17T00:30:51.642803Z",
    "utsname_hostname": "rook-ceph-rbd-mirror-a-fb7fcb66f-j48ql",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.34.2.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Mon Jan 17 09:42:23 EST 2022"
}

Expected results:
There should not be any crash

Additional info:

Comment 2 Mudit Agarwal 2022-03-08 13:24:00 UTC
Ilya, Please take a look.

Comment 21 Mudit Agarwal 2022-04-05 13:46:06 UTC
Moving DR BZs to 4.10.z/4.11

Comment 26 Scott Ostapovicz 2022-04-14 15:50:07 UTC
The net conclusion on the blocker status was that this might be a blocker for GA, but it is not a blocker for a tech preview.

Comment 29 Mudit Agarwal 2022-06-29 13:35:35 UTC
Ilya, do we have plans to fix this issue in Ceph 5.2?
If yes, can we please create a ceph bz?

Comment 31 Mudit Agarwal 2022-06-29 13:55:51 UTC
Thanks, moving it to 4.12 then.

Comment 53 Red Hat Bugzilla 2023-12-08 04:27:51 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days