Bug 2228251

Summary: [GSS][CephFS] MDS crashing with ceph_assert(lock->get_state() == LOCK_LOCK || lock->get_state() == LOCK_MIX || lock->get_state() == LOCK_MIX_SYNC2)
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Anton Mark <amark>
Component: CephFSAssignee: Xiubo Li <xiubli>
Status: ASSIGNED --- QA Contact: Hemanth Kumar <hyelloji>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.0CC: bkunal, ceph-eng-bugs, cephqe-warriors, vshankar, xiubli
Target Milestone: ---Flags: xiubli: needinfo? (amark)
Target Release: 7.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Anton Mark 2023-08-01 20:35:34 UTC
Description of problem:
Slow/blocked ops leading to mds failures with MDS crash. After multiple ranks of MDS daemons fail the cluster returned to normal operation.


Version-Release number of selected component (if applicable):
{
    "mon": {
        "ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)": 5
    },
    "mgr": {
        "ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)": 2
    },
    "osd": {
        "ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)": 312
    },
    "mds": {
        "ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)": 8
    },
    "overall": {
        "ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)": 327
    }
}

How reproducible:
Unknown.

Steps to Reproduce:
Unknown.

Actual results:
Client IO interruption and MDS daemon crash.

Expected results:
No crash or IO interruption for clients.

Additional info:

{
    "assert_condition": "lock->get_state() == LOCK_LOCK || lock->get_state() == LOCK_MIX || lock->get_state() == LOCK_MIX_SYNC2",
    "assert_file": "/builddir/build/BUILD/ceph-17.2.6/src/mds/Locker.cc",
    "assert_func": "void Locker::handle_file_lock(ScatterLock*, ceph::cref_t<MLock>&)",
    "assert_line": 5767,
    "assert_msg": "/builddir/build/BUILD/ceph-17.2.6/src/mds/Locker.cc: In function 'void Locker::handle_file_lock(ScatterLock*, ceph::cref_t<MLock>&)' thread 7f21a0425640 time 2023-07-31T19:16:21.053778+0000\n/builddir/build/BUILD/ceph-17.2.6/src/mds/Locker.cc: 5767: FAILED ceph_assert(lock->get_state() == LOCK_LOCK || lock->get_state() == LOCK_MIX || lock->get_state() == LOCK_MIX_SYNC2)\n",
    "assert_thread_name": "ms_dispatch",
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7f21a52c0df0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f21a530d54c]",
        "raise()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f21a591dae1]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x143c45) [0x7f21a591dc45]",
        "(Locker::handle_file_lock(ScatterLock*, boost::intrusive_ptr<MLock const> const&)+0x14ac) [0x5571d93300ac]",
        "(Locker::dispatch(boost::intrusive_ptr<Message const> const&)+0x164) [0x5571d9304d34]",
        "(MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x4f3) [0x5571d919a443]",
        "(MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x5571d919a99c]",
        "(MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x195) [0x5571d9184815]",
        "(DispatchQueue::entry()+0x53a) [0x7f21a5b0975a]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x3bb6b1) [0x7f21a5b956b1]",
        "/lib64/libc.so.6(+0x9f802) [0x7f21a530b802]",
        "clone()"
    ],
    "ceph_version": "17.2.6-70.el9cp",
    "crash_id": "2023-07-31T19:16:21.079357Z_5e8c4120-46ce-4e59-9a0f-3bd35449dfcb",
    "entity_name": "mds.root.host10.fckajv",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-mds",
    "stack_sig": "62a9400fe4f62ee81c69ab8eb327515c3a3ab2affea4ab8fd8f190aad70bfeba",
    "timestamp": "2023-07-31T19:16:21.079357Z",
    "utsname_hostname": "host10",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-477.13.1.el8_8.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Thu May 18 10:27:05 EDT 2023"