Bug 1265973

Summary: After an upgrade from 1.1 to 1.3 through 1.2.3, OSD process is crashing.
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: shilpa <smanjara>
Component: RADOSAssignee: Samuel Just <sjust>
Status: CLOSED ERRATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 1.3.0CC: ceph-eng-bugs, dzafman, flucifre, kchai, kdreyer, sjust, vakulkar
Target Milestone: rc   
Target Release: 1.3.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-0.94.3-2.el7cp (RHEL) Ceph v0.94.3.2 (Ubuntu) Doc Type: Known Issue
Doc Text:
OSD fails after upgrading from Ceph version 1.1 to 1.3 When Ceph version 1.3 creates a new Object Storage Device (OSD) on a Ceph cluster where monitors still have maps created with Ceph version 1.1, or the new OSD has not communicated with a monitor since upgrading to version 1.1, the OSD process terminates unexpectedly.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-23 20:22:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1230323    
Attachments:
Description Flags
10.12.27.11 monstore
none
Reproducer yaml none

Comment 6 Kefu Chai 2015-09-24 10:57:01 UTC
=== on mon (mon.hp-ms-01-c10, 10.12.27.10:6789) side  ====

2015-09-23 12:45:53.831426 7ffca1f797c0  0 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 2905
2015-09-23 12:45:54.111469 7ffca1f797c0  0 starting mon.hp-ms-01-c10 rank 0 at 10.12.27.10:6789/0 mon_data /var/lib/ceph/mon/ceph-hp-ms-01-c10 fsid 59ee50d3-6435-4bdd-9ddc-e15a2556b592
...
2015-09-24 03:26:44.830124 7ffc9a606700  1 -- 10.12.27.10:6789/0 <== osd.4 10.12.27.13:6800/5284 7 ==== mon_get_version(what=osdmap handle=1) v1 ==== 18+0+0 (4194021778 0 0) 0x5528b40 con 0x61c98c0
2015-09-24 03:26:44.830263 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- mon_check_map_ack(handle=1 version=102) v2 -- ?+0 0x552bde0 con 0x61c98c0
2015-09-24 03:26:44.831461 7ffc9a606700  1 -- 10.12.27.10:6789/0 <== osd.4 10.12.27.13:6800/5284 8 ==== mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=1}) v2 ==== 69+0+0 (3916052530 0 0) 0x5b6e800 con 0x61c98c0
2015-09-24 03:26:44.841197 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- osd_map(1..101 src has 1..102) v3 -- ?+0 0x683e540 con 0x61c98c0
2015-09-24 03:26:44.841303 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- mon_subscribe_ack(300s) v1 -- ?+0 0x5528b40 con 0x61c98c0
2015-09-24 03:26:44.844545 7ffc9a606700  1 -- 10.12.27.10:6789/0 <== osd.4 10.12.27.13:6800/5284 9 ==== mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=0}) v2 ==== 69+0+0 (492105338 0 0) 0x5b6ce00 con 0x61c98c0
2015-09-24 03:26:44.844991 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- osd_map(102..102 src has 1..102) v3 -- ?+0 0x5870000 con 0x61c98c0
2015-09-24 03:26:44.845054 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- mon_subscribe_ack(300s) v1 -- ?+0 0x552d280 con 0x61c98c0


==== on osd.4 (10.12.27.13:6800/5942) side =======

    -8> 2015-09-24 03:27:21.106064 7f7456fa1700 10 monclient: renew_subs
    -7> 2015-09-24 03:27:21.106106 7f7456fa1700 10 monclient: _send_mon_message to mon.hp-ms-01-c10 at 10.12.27.10:6789/0
    -6> 2015-09-24 03:27:21.106143 7f7456fa1700  1 -- 10.12.27.13:6800/5942 --> 10.12.27.10:6789/0 -- mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=1}) v2 -- ?+0 0x3fc4a00 con 0x3fd6460
    -5> 2015-09-24 03:27:21.116237 7f745f7b2700  1 -- 10.12.27.13:6800/5942 <== mon.0 10.12.27.10:6789/0 11 ==== osd_map(1..101 src has 1..102) v3 ==== 37644+0+0 (1705508942 0 0) 0x3fa4140 con 0x3fd6460
    -4> 2015-09-24 03:27:21.116323 7f745f7b2700 10 monclient: renew_subs
    -3> 2015-09-24 03:27:21.116344 7f745f7b2700 10 monclient: _send_mon_message to mon.hp-ms-01-c10 at 10.12.27.10:6789/0
    -2> 2015-09-24 03:27:21.116384 7f745f7b2700  1 -- 10.12.27.13:6800/5942 --> 10.12.27.10:6789/0 -- mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=0}) v2 -- ?+0 0x3fc2400 con 0x3fd6460
    -1> 2015-09-24 03:27:21.116456 7f745f7b2700  3 osd.4 0 handle_osd_map epochs [1,101], i have 0, src has [1,102]
     0> 2015-09-24 03:27:21.123355 7f745f7b2700 -1 *** Caught signal (Aborted) **
 in thread 7f745f7b2700


 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-osd() [0x9f63f2]
 2: (()+0xf130) [0x7f746ec9c130]
 3: (gsignal()+0x37) [0x7f746d6b65d7]
 4: (abort()+0x148) [0x7f746d6b7cc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f746dfba9b5]
 6: (()+0x5e926) [0x7f746dfb8926]
 7: (()+0x5e953) [0x7f746dfb8953]
 8: (()+0x5eb73) [0x7f746dfb8b73]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) [0xb697b7]
 10: (OSDMap::decode_classic(ceph::buffer::list::iterator&)+0x605) [0xab1a35]
 11: (OSDMap::decode(ceph::buffer::list::iterator&)+0x8c) [0xab213c]
 12: (OSDMap::decode(ceph::buffer::list&)+0x4f) [0xab425f]
 13: (OSD::handle_osd_map(MOSDMap*)+0xd3d) [0x6700ad]
 14: (OSD::_dispatch(Message*)+0x41b) [0x6732eb]
 15: (OSD::ms_dispatch(Message*)+0x277) [0x673817]
 16: (DispatchQueue::entry()+0x64a) [0xbaf77a]
 17: (DispatchQueue::DispatchThread::entry()+0xd) [0xad4f6d]
 18: (()+0x7df5) [0x7f746ec94df5]
 19: (clone()+0x6d) [0x7f746d7771ad]


all osd.{2,3,4}.log have the same backtrace and log message before they crashed.

Comment 7 Kefu Chai 2015-09-24 12:41:49 UTC
ktdreyer suspects that there could be some special config option which could trigger the problem. cause we have firefly x hammer upgrade test suites already in ceph-qa-suites.

Comment 8 Ken Dreyer (Red Hat) 2015-09-24 13:56:29 UTC
Sam, would you mind looking into this OSD crash, or else re-assigning as appropriate?

Comment 11 Kefu Chai 2015-09-24 15:51:08 UTC
sam,

upgrade/restart the monitors one after another, then upgrade and restart the OSDs one after another. this is what i learnt from Shilpa.

Comment 16 shilpa 2015-09-24 18:39:11 UTC
Created attachment 1076658 [details]
10.12.27.11 monstore

Comment 22 Samuel Just 2015-09-25 00:49:02 UTC
Created attachment 1076829 [details]
Reproducer yaml

Comment 26 Ken Dreyer (Red Hat) 2015-09-28 16:24:24 UTC
From Sam in #rh-ceph today:

This bug is caused by starting a new hammer OSD on a cluster where the mons still have maps created in dumpling. Alternatively, start a hammer OSD which has not spoken to a mon since dumpling.

Comment 27 Federico Lucifredi 2015-09-28 20:15:24 UTC
We are keeping this in the 1.3.1. release.

QE should to re-do the test, making sure that they bring their 1.1 -> 1.2 cluster up to "active+clean" before proceeding to Hammer, per Ken.

Added to Known Issue tracker (1262054) for the Doc team to add to release notes.

Comment 31 Federico Lucifredi 2015-09-30 00:25:52 UTC
re-assigning to correct known-issue tracker (1.3.0). Thanks Harish, good eyes!

Comment 37 shilpa 2015-11-06 07:04:24 UTC
Verified on ceph-0.94.3-3.el7cp.x86_64

Upgraded from 1.1(RHEL6.7) - > 1.2.3 -> 1.3.1(RHEL 7.1)

No crashes found. I/O's are running fine. 

# ceph health
HEALTH_OK

Comment 39 errata-xmlrpc 2015-11-23 20:22:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2512

Comment 40 Siddharth Sharma 2015-11-23 21:53:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2066