Bug 1497243

Summary: [CephFS] During mds rejoin, it got asserted in degraded cluster
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Ramakrishnan Periyasamy <rperiyas>
Component: CephFSAssignee: Patrick Donnelly <pdonnell>
Status: CLOSED WORKSFORME QA Contact: Ramakrishnan Periyasamy <rperiyas>
Severity: high Docs Contact:
Priority: high    
Version: 3.0CC: bniver, ceph-eng-bugs, hnallurv, john.spray, rperiyas, zyan
Target Milestone: rc   
Target Release: 3.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-24 03:51:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
failed MDS log
none
mds.0 node logs. none

Description Ramakrishnan Periyasamy 2017-09-29 14:41:13 UTC
Created attachment 1332429 [details]
failed MDS log

Description of problem:
MDS asserted while rejoining cluster.

2017-09-29 14:22:08.372504 7fd43ac3a700  0 -- 10.8.128.61:6800/2359662045 >> 10.8.128.58:6800/3613195149 conn(0x55e9fd8ed000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY
2017-09-29 14:22:08.406067 7fd4384c0700  1 mds.1.10181 resolve_done
2017-09-29 14:22:08.409016 7fd4384c0700 -1 /builddir/build/BUILD/ceph-12.2.1/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)' thread 7fd4384c0700 time 2017-09-29 14:22:08.406215
/builddir/build/BUILD/ceph-12.2.1/src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin())

 ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55e9f4316390]
 2: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x16cf) [0x55e9f410ae9f]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x55e9f410f75b]
 4: (MDCache::dispatch(Message*)+0xa5) [0x55e9f4114d85]
 5: (MDSRank::handle_deferrable_message(Message*)+0x5c4) [0x55e9f3ffd624]
 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55e9f400af13]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55e9f400bd55]
 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55e9f3ff4f33]
 9: (DispatchQueue::entry()+0x792) [0x55e9f45f9c12]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55e9f439c5fd]
 11: (()+0x7e25) [0x7fd43d712e25]
 12: (clone()+0x6d) [0x7fd43c7f534d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Version-Release number of selected component (if applicable):
ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)

How reproducible:
1/1

Steps to Reproduce:
Cluster was in degraded state. removed one OSD node(3 OSD daemon) from cluster and changed the replication size from 3 to 2 (min_size 1 & size 2), during degraded IO's did failover of MDS.

Actual results:
NA

Expected results:
NA

Additional info:
NA

Comment 2 Yan, Zheng 2017-10-01 02:02:55 UTC
probably mds state is still resolve, wanted_state is rejoin

Comment 3 Yan, Zheng 2017-10-01 02:15:33 UTC
Ignore my previous comment. Please upload log of mds.0

Comment 4 Ramakrishnan Periyasamy 2017-10-04 04:56:45 UTC
Created attachment 1334036 [details]
mds.0 node logs.

Attached mds.0 node logs.

Comment 6 Yan, Zheng 2017-10-09 02:54:11 UTC
yes. the mds got MMDSCacheRejoin message in unexpected state, so it killed itself.

The unexpected message was sent from 10.8.128.58 (magna058). but it seems the log has been deleted.

Comment 14 Ramakrishnan Periyasamy 2017-10-24 03:51:10 UTC
Tried multiple times unable to reproduce the bug, For now moving the bug to closed state as "Works for ME". Will re-open the bug if mds assert reproduced.