Bug 1594760

Summary: mds, multimds: failed assertion in mds post failover
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Venky Shankar <vshankar>
Component: CephFSAssignee: Patrick Donnelly <pdonnell>
Status: CLOSED DUPLICATE QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 3.0CC: ceph-eng-bugs, john.spray
Target Milestone: z5Keywords: CodeChange
Target Release: 3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-29 23:58:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Venky Shankar 2018-06-25 11:23:00 UTC
Description of problem:

In a multimds setup, post failover of mds rank 0 results in mds being unresponsive for a certain period followed by usual reconnection and eviction of unresponsive clients. At times there is a failed assertion in the now active mds:

1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f02ab91f850]
 2: (MDCache::request_get(metareqid_t)+0x267) [0x7f02ab6d7967]
 3: (Server::handle_slave_request_reply(MMDSSlaveRequest*)+0x314) [0x7f02ab68d6c4]
 4: (Server::handle_slave_request(MMDSSlaveRequest*)+0x9ab) [0x7f02ab68edfb]
 5: (Server::dispatch(Message*)+0x633) [0x7f02ab68fad3]
 6: (MDSRank::handle_deferrable_message(Message*)+0x804) [0x7f02ab6068f4]
 7: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x7f02ab614573]
 8: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x7f02ab6153b5]
 9: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x7f02ab5fdff3]
 10: (DispatchQueue::entry()+0x792) [0x7f02abc03be2]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f02ab9a4fbd]
 12: (()+0x7e25) [0x7f02a93f9e25]