1497243 – [CephFS] During mds rejoin, it got asserted in degraded cluster

Bug 1497243 - [CephFS] During mds rejoin, it got asserted in degraded cluster

Summary: [CephFS] During mds rejoin, it got asserted in degraded cluster

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	3.1
Assignee:	Patrick Donnelly
QA Contact:	Ramakrishnan Periyasamy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-29 14:41 UTC by Ramakrishnan Periyasamy
Modified:	2017-10-24 03:51 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-24 03:51:10 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
failed MDS log (14.06 MB, text/plain) 2017-09-29 14:41 UTC, Ramakrishnan Periyasamy	no flags	Details
mds.0 node logs. (51.68 KB, application/x-gzip) 2017-10-04 04:56 UTC, Ramakrishnan Periyasamy	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	21777	0	None	None	None	2017-10-12 22:49:10 UTC

Description Ramakrishnan Periyasamy 2017-09-29 14:41:13 UTC

Created attachment 1332429 [details]
failed MDS log

Description of problem:
MDS asserted while rejoining cluster.

2017-09-29 14:22:08.372504 7fd43ac3a700  0 -- 10.8.128.61:6800/2359662045 >> 10.8.128.58:6800/3613195149 conn(0x55e9fd8ed000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY
2017-09-29 14:22:08.406067 7fd4384c0700  1 mds.1.10181 resolve_done
2017-09-29 14:22:08.409016 7fd4384c0700 -1 /builddir/build/BUILD/ceph-12.2.1/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)' thread 7fd4384c0700 time 2017-09-29 14:22:08.406215
/builddir/build/BUILD/ceph-12.2.1/src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin())

 ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55e9f4316390]
 2: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x16cf) [0x55e9f410ae9f]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x55e9f410f75b]
 4: (MDCache::dispatch(Message*)+0xa5) [0x55e9f4114d85]
 5: (MDSRank::handle_deferrable_message(Message*)+0x5c4) [0x55e9f3ffd624]
 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55e9f400af13]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55e9f400bd55]
 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55e9f3ff4f33]
 9: (DispatchQueue::entry()+0x792) [0x55e9f45f9c12]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55e9f439c5fd]
 11: (()+0x7e25) [0x7fd43d712e25]
 12: (clone()+0x6d) [0x7fd43c7f534d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Version-Release number of selected component (if applicable):
ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)

How reproducible:
1/1

Steps to Reproduce:
Cluster was in degraded state. removed one OSD node(3 OSD daemon) from cluster and changed the replication size from 3 to 2 (min_size 1 & size 2), during degraded IO's did failover of MDS.

Actual results:
NA

Expected results:
NA

Additional info:
NA

Comment 2 Yan, Zheng 2017-10-01 02:02:55 UTC

probably mds state is still resolve, wanted_state is rejoin

Comment 3 Yan, Zheng 2017-10-01 02:15:33 UTC

Ignore my previous comment. Please upload log of mds.0

Comment 4 Ramakrishnan Periyasamy 2017-10-04 04:56:45 UTC

Created attachment 1334036 [details]
mds.0 node logs.

Attached mds.0 node logs.

Comment 6 Yan, Zheng 2017-10-09 02:54:11 UTC

yes. the mds got MMDSCacheRejoin message in unexpected state, so it killed itself.

The unexpected message was sent from 10.8.128.58 (magna058). but it seems the log has been deleted.

Comment 14 Ramakrishnan Periyasamy 2017-10-24 03:51:10 UTC

Tried multiple times unable to reproduce the bug, For now moving the bug to closed state as "Works for ME". Will re-open the bug if mds assert reproduced.

Note You need to log in before you can comment on or make changes to this bug.