Created attachment 1332429 [details] failed MDS log Description of problem: MDS asserted while rejoining cluster. 2017-09-29 14:22:08.372504 7fd43ac3a700 0 -- 10.8.128.61:6800/2359662045 >> 10.8.128.58:6800/3613195149 conn(0x55e9fd8ed000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY 2017-09-29 14:22:08.406067 7fd4384c0700 1 mds.1.10181 resolve_done 2017-09-29 14:22:08.409016 7fd4384c0700 -1 /builddir/build/BUILD/ceph-12.2.1/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)' thread 7fd4384c0700 time 2017-09-29 14:22:08.406215 /builddir/build/BUILD/ceph-12.2.1/src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin()) ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55e9f4316390] 2: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x16cf) [0x55e9f410ae9f] 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x55e9f410f75b] 4: (MDCache::dispatch(Message*)+0xa5) [0x55e9f4114d85] 5: (MDSRank::handle_deferrable_message(Message*)+0x5c4) [0x55e9f3ffd624] 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55e9f400af13] 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55e9f400bd55] 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55e9f3ff4f33] 9: (DispatchQueue::entry()+0x792) [0x55e9f45f9c12] 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55e9f439c5fd] 11: (()+0x7e25) [0x7fd43d712e25] 12: (clone()+0x6d) [0x7fd43c7f534d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Version-Release number of selected component (if applicable): ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable) How reproducible: 1/1 Steps to Reproduce: Cluster was in degraded state. removed one OSD node(3 OSD daemon) from cluster and changed the replication size from 3 to 2 (min_size 1 & size 2), during degraded IO's did failover of MDS. Actual results: NA Expected results: NA Additional info: NA
probably mds state is still resolve, wanted_state is rejoin
Ignore my previous comment. Please upload log of mds.0
Created attachment 1334036 [details] mds.0 node logs. Attached mds.0 node logs.
yes. the mds got MMDSCacheRejoin message in unexpected state, so it killed itself. The unexpected message was sent from 10.8.128.58 (magna058). but it seems the log has been deleted.
Tried multiple times unable to reproduce the bug, For now moving the bug to closed state as "Works for ME". Will re-open the bug if mds assert reproduced.