Bug 1497243

Summary:

[CephFS] During mds rejoin, it got asserted in degraded cluster

Product:

[Red Hat Storage] Red Hat Ceph Storage

Reporter:

Ramakrishnan Periyasamy <rperiyas>

Component:

CephFS

Assignee:

Patrick Donnelly <pdonnell>

Status:

CLOSED WORKSFORME

QA Contact:

Ramakrishnan Periyasamy <rperiyas>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.0

CC:

bniver, ceph-eng-bugs, hnallurv, john.spray, rperiyas, zyan

Target Milestone:

Target Release:

3.1

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-10-24 03:51:10 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
failed MDS log	none
mds.0 node logs.	none

Description Ramakrishnan Periyasamy 2017-09-29 14:41:13 UTC

Created attachment 1332429 [details]
failed MDS log

Description of problem:
MDS asserted while rejoining cluster.

2017-09-29 14:22:08.372504 7fd43ac3a700  0 -- 10.8.128.61:6800/2359662045 >> 10.8.128.58:6800/3613195149 conn(0x55e9fd8ed000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_CONNECT_REPLY
2017-09-29 14:22:08.406067 7fd4384c0700  1 mds.1.10181 resolve_done
2017-09-29 14:22:08.409016 7fd4384c0700 -1 /builddir/build/BUILD/ceph-12.2.1/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)' thread 7fd4384c0700 time 2017-09-29 14:22:08.406215
/builddir/build/BUILD/ceph-12.2.1/src/mds/MDCache.cc: 4332: FAILED assert(mds->is_rejoin())

 ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55e9f4316390]
 2: (MDCache::handle_cache_rejoin_weak(MMDSCacheRejoin*)+0x16cf) [0x55e9f410ae9f]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x24b) [0x55e9f410f75b]
 4: (MDCache::dispatch(Message*)+0xa5) [0x55e9f4114d85]
 5: (MDSRank::handle_deferrable_message(Message*)+0x5c4) [0x55e9f3ffd624]
 6: (MDSRank::_dispatch(Message*, bool)+0x1e3) [0x55e9f400af13]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55e9f400bd55]
 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55e9f3ff4f33]
 9: (DispatchQueue::entry()+0x792) [0x55e9f45f9c12]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55e9f439c5fd]
 11: (()+0x7e25) [0x7fd43d712e25]
 12: (clone()+0x6d) [0x7fd43c7f534d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Version-Release number of selected component (if applicable):
ceph version 12.2.1-2.el7cp (965390e1785cd23ffde159014f25f9490e479668) luminous (stable)

How reproducible:
1/1

Steps to Reproduce:
Cluster was in degraded state. removed one OSD node(3 OSD daemon) from cluster and changed the replication size from 3 to 2 (min_size 1 & size 2), during degraded IO's did failover of MDS.

Actual results:
NA

Expected results:
NA

Additional info:
NA

Comment 2 Yan, Zheng 2017-10-01 02:02:55 UTC

probably mds state is still resolve, wanted_state is rejoin

Comment 3 Yan, Zheng 2017-10-01 02:15:33 UTC

Ignore my previous comment. Please upload log of mds.0

Comment 4 Ramakrishnan Periyasamy 2017-10-04 04:56:45 UTC

Created attachment 1334036 [details]
mds.0 node logs.

Attached mds.0 node logs.

Comment 6 Yan, Zheng 2017-10-09 02:54:11 UTC

yes. the mds got MMDSCacheRejoin message in unexpected state, so it killed itself.

The unexpected message was sent from 10.8.128.58 (magna058). but it seems the log has been deleted.

Comment 14 Ramakrishnan Periyasamy 2017-10-24 03:51:10 UTC

Tried multiple times unable to reproduce the bug, For now moving the bug to closed state as "Works for ME". Will re-open the bug if mds assert reproduced.