Bug 1497243 - [CephFS] During mds rejoin, it got asserted in degraded cluster
Summary: [CephFS] During mds rejoin, it got asserted in degraded cluster
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 3.1
Assignee: Patrick Donnelly
QA Contact: Ramakrishnan Periyasamy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-29 14:41 UTC by Ramakrishnan Periyasamy
Modified: 2017-10-24 03:51 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-24 03:51:10 UTC
Embargoed:


Attachments (Terms of Use)
failed MDS log (14.06 MB, text/plain)
2017-09-29 14:41 UTC, Ramakrishnan Periyasamy
no flags Details
mds.0 node logs. (51.68 KB, application/x-gzip)
2017-10-04 04:56 UTC, Ramakrishnan Periyasamy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 21777 0 None None None 2017-10-12 22:49:10 UTC

Description Ramakrishnan Periyasamy 2017-09-29 14:41:13 UTC
Created attachment 1332429 [details]
failed MDS log

Description of problem:
MDS asserted while rejoining cluster.

2017-09-29 14:22:08.372504 7fd43ac3a700  0 -- 10.8.128.61:6800/2359662045 >> 10.8.128.58:6800/3613195149 conn(0x55e9fd8ed000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY
2017-09-29 14:22:08.406067 7fd4384c0700  1 mds.1.10181 resolve_done
2017-09-29 14:22:08.409016 7fd4384c0700 -1 /builddir/build/BUILD/ceph-12.2.1/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)' thread 7fd4384c0700 time 2017-09-29 14:22:08.406215
/builddir/build/BUILD/ceph-12.2.1/src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin())

 ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55e9f4316390]
 2: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x16cf) [0x55e9f410ae9f]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x55e9f410f75b]
 4: (MDCache::dispatch(Message*)+0xa5) [0x55e9f4114d85]
 5: (MDSRank::handle_deferrable_message(Message*)+0x5c4) [0x55e9f3ffd624]
 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55e9f400af13]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55e9f400bd55]
 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55e9f3ff4f33]
 9: (DispatchQueue::entry()+0x792) [0x55e9f45f9c12]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55e9f439c5fd]
 11: (()+0x7e25) [0x7fd43d712e25]
 12: (clone()+0x6d) [0x7fd43c7f534d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Version-Release number of selected component (if applicable):
ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)

How reproducible:
1/1

Steps to Reproduce:
Cluster was in degraded state. removed one OSD node(3 OSD daemon) from cluster and changed the replication size from 3 to 2 (min_size 1 & size 2), during degraded IO's did failover of MDS.

Actual results:
NA

Expected results:
NA

Additional info:
NA

Comment 2 Yan, Zheng 2017-10-01 02:02:55 UTC
probably mds state is still resolve, wanted_state is rejoin

Comment 3 Yan, Zheng 2017-10-01 02:15:33 UTC
Ignore my previous comment. Please upload log of mds.0

Comment 4 Ramakrishnan Periyasamy 2017-10-04 04:56:45 UTC
Created attachment 1334036 [details]
mds.0 node logs.

Attached mds.0 node logs.

Comment 6 Yan, Zheng 2017-10-09 02:54:11 UTC
yes. the mds got MMDSCacheRejoin message in unexpected state, so it killed itself.

The unexpected message was sent from 10.8.128.58 (magna058). but it seems the log has been deleted.

Comment 14 Ramakrishnan Periyasamy 2017-10-24 03:51:10 UTC
Tried multiple times unable to reproduce the bug, For now moving the bug to closed state as "Works for ME". Will re-open the bug if mds assert reproduced.


Note You need to log in before you can comment on or make changes to this bug.