Bug 1289573 - OSD crash after re-weighting multiple OSDs repeatedly
OSD crash after re-weighting multiple OSDs repeatedly
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS (Show other bugs)
1.3.1
Unspecified Unspecified
unspecified Severity high
: rc
: 1.3.3
Assigned To: Greg Farnum
ceph-qe-bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-08 08:24 EST by Shruti Sampat
Modified: 2017-07-30 11:15 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-06-15 11:04:29 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 12805 None None None Never

  None (edit)
Description Shruti Sampat 2015-12-08 08:24:07 EST
Description of problem:
-----------------------

Out of 8 OSDs in the cluster, all of them were reweighted one after the other, some of them more than once. After a while, the state of all the PGs were active+clean but one of the OSDs was down. The following is from OSD logs -

<snip> 

--- begin dump of recent events ---
    -2> 2015-12-08 17:02:35.017256 7fc065aa3700  5 osd.6 pg_epoch: 50238 pg[0.4b( v 50171'14011 (48628'4816,50171'14011] local-les=50173 n=23 ec=1 les/c 50173/49277 501
72/50172/49957) [4,2,6,0]/[6,2,3,7] r=0 lpr=50172 pi=49275-50171/12 bft=0,4 crt=50171'14009 lcod 50171'14010 mlcod 0'0 active+remapped+backfill_toofull] exit Started/Pr
imary/Active/NotBackfilling 10.000193 24 0.000302
    -1> 2015-12-08 17:02:35.017321 7fc065aa3700  5 osd.6 pg_epoch: 50238 pg[0.4b( v 50171'14011 (48628'4816,50171'14011] local-les=50173 n=23 ec=1 les/c 50173/49277 501
72/50172/49957) [4,2,6,0]/[6,2,3,7] r=0 lpr=50172 pi=49275-50171/12 bft=0,4 crt=50171'14009 lcod 50171'14010 mlcod 0'0 active+remapped+backfill_toofull] enter Started/P
rimary/Active/WaitLocalBackfillReserved
     0> 2015-12-08 17:02:35.174654 7fc07a684700 -1 *** Caught signal (Aborted) **
 in thread 7fc07a684700

 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
 1: /usr/bin/ceph-osd() [0xa20592]
 2: (()+0xf100) [0x7fc080a1b100]
 3: (gsignal()+0x37) [0x7fc07f4335f7]
 4: (abort()+0x148) [0x7fc07f434ce8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fc07fd379d5]
 6: (()+0x5e946) [0x7fc07fd35946]
 7: (()+0x5e973) [0x7fc07fd35973]
 8: (()+0x5eb93) [0x7fc07fd35b93]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb210ca]
 10: (SimpleMessenger::reaper()+0xbd4) [0xaf86b4]
 11: (SimpleMessenger::reaper_entry()+0x150) [0xaf8830]
 12: (SimpleMessenger::ReaperThread::entry()+0xd) [0xb002ed]
 13: (()+0x7dc5) [0x7fc080a13dc5]
 14: (clone()+0x6d) [0x7fc07f4f421d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

</snip>

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

How reproducible:
-----------------
Tried once.

Steps to Reproduce:
-------------------
1. In a cluster consisting of 6 OSDs across 3 nodes, 2 new OSDs on another node were added while `dd' ran on the mountpoint. 
2. All the 8 OSDs were reweighted one after the other.

Actual results:
---------------
One OSD crashed.

Expected results:
-----------------
No crashes.
Comment 3 David Zafman 2015-12-09 16:11:10 EST
Upstream tracker 12805:  http://tracker.ceph.com/issues/12805
Comment 4 Greg Farnum 2015-12-11 17:50:57 EST
Unfortunately, there's no way we'll be able to diagnose this issue without messenger logging cranked up to 20, and possibly the core file. The only thing in this log is the op tracker output and a few message deliveries. If it's easily reproducible under this workload, great! Otherwise, it's the second report of this crash in many months and we've been through the code pretty carefully to look for missing cases.
Comment 5 Samuel Just 2016-06-15 11:04:29 EDT
This has been open for quite a while with no new information, I'm closing it.

Note You need to log in before you can comment on or make changes to this bug.