This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1265973 - After an upgrade from 1.1 to 1.3 through 1.2.3, OSD process is crashing.
After an upgrade from 1.1 to 1.3 through 1.2.3, OSD process is crashing.
Status: CLOSED ERRATA
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS (Show other bugs)
1.3.0
Unspecified Unspecified
unspecified Severity unspecified
: rc
: 1.3.1
Assigned To: Samuel Just
ceph-qe-bugs
:
Depends On:
Blocks: 1230323
  Show dependency treegraph
 
Reported: 2015-09-24 04:47 EDT by shilpa
Modified: 2017-07-30 11:13 EDT (History)
7 users (show)

See Also:
Fixed In Version: ceph-0.94.3-2.el7cp (RHEL) Ceph v0.94.3.2 (Ubuntu)
Doc Type: Known Issue
Doc Text:
OSD fails after upgrading from Ceph version 1.1 to 1.3 When Ceph version 1.3 creates a new Object Storage Device (OSD) on a Ceph cluster where monitors still have maps created with Ceph version 1.1, or the new OSD has not communicated with a monitor since upgrading to version 1.1, the OSD process terminates unexpectedly.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-23 15:22:44 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
10.12.27.11 monstore (1.22 MB, application/x-tar)
2015-09-24 14:39 EDT, shilpa
no flags Details
Reproducer yaml (2.87 KB, text/plain)
2015-09-24 20:49 EDT, Samuel Just
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 13234 None None None Never

  None (edit)
Comment 6 Kefu Chai 2015-09-24 06:57:01 EDT
=== on mon (mon.hp-ms-01-c10, 10.12.27.10:6789) side  ====

2015-09-23 12:45:53.831426 7ffca1f797c0  0 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 2905
2015-09-23 12:45:54.111469 7ffca1f797c0  0 starting mon.hp-ms-01-c10 rank 0 at 10.12.27.10:6789/0 mon_data /var/lib/ceph/mon/ceph-hp-ms-01-c10 fsid 59ee50d3-6435-4bdd-9ddc-e15a2556b592
...
2015-09-24 03:26:44.830124 7ffc9a606700  1 -- 10.12.27.10:6789/0 <== osd.4 10.12.27.13:6800/5284 7 ==== mon_get_version(what=osdmap handle=1) v1 ==== 18+0+0 (4194021778 0 0) 0x5528b40 con 0x61c98c0
2015-09-24 03:26:44.830263 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- mon_check_map_ack(handle=1 version=102) v2 -- ?+0 0x552bde0 con 0x61c98c0
2015-09-24 03:26:44.831461 7ffc9a606700  1 -- 10.12.27.10:6789/0 <== osd.4 10.12.27.13:6800/5284 8 ==== mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=1}) v2 ==== 69+0+0 (3916052530 0 0) 0x5b6e800 con 0x61c98c0
2015-09-24 03:26:44.841197 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- osd_map(1..101 src has 1..102) v3 -- ?+0 0x683e540 con 0x61c98c0
2015-09-24 03:26:44.841303 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- mon_subscribe_ack(300s) v1 -- ?+0 0x5528b40 con 0x61c98c0
2015-09-24 03:26:44.844545 7ffc9a606700  1 -- 10.12.27.10:6789/0 <== osd.4 10.12.27.13:6800/5284 9 ==== mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=0}) v2 ==== 69+0+0 (492105338 0 0) 0x5b6ce00 con 0x61c98c0
2015-09-24 03:26:44.844991 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- osd_map(102..102 src has 1..102) v3 -- ?+0 0x5870000 con 0x61c98c0
2015-09-24 03:26:44.845054 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- mon_subscribe_ack(300s) v1 -- ?+0 0x552d280 con 0x61c98c0


==== on osd.4 (10.12.27.13:6800/5942) side =======

    -8> 2015-09-24 03:27:21.106064 7f7456fa1700 10 monclient: renew_subs
    -7> 2015-09-24 03:27:21.106106 7f7456fa1700 10 monclient: _send_mon_message to mon.hp-ms-01-c10 at 10.12.27.10:6789/0
    -6> 2015-09-24 03:27:21.106143 7f7456fa1700  1 -- 10.12.27.13:6800/5942 --> 10.12.27.10:6789/0 -- mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=1}) v2 -- ?+0 0x3fc4a00 con 0x3fd6460
    -5> 2015-09-24 03:27:21.116237 7f745f7b2700  1 -- 10.12.27.13:6800/5942 <== mon.0 10.12.27.10:6789/0 11 ==== osd_map(1..101 src has 1..102) v3 ==== 37644+0+0 (1705508942 0 0) 0x3fa4140 con 0x3fd6460
    -4> 2015-09-24 03:27:21.116323 7f745f7b2700 10 monclient: renew_subs
    -3> 2015-09-24 03:27:21.116344 7f745f7b2700 10 monclient: _send_mon_message to mon.hp-ms-01-c10 at 10.12.27.10:6789/0
    -2> 2015-09-24 03:27:21.116384 7f745f7b2700  1 -- 10.12.27.13:6800/5942 --> 10.12.27.10:6789/0 -- mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=0}) v2 -- ?+0 0x3fc2400 con 0x3fd6460
    -1> 2015-09-24 03:27:21.116456 7f745f7b2700  3 osd.4 0 handle_osd_map epochs [1,101], i have 0, src has [1,102]
     0> 2015-09-24 03:27:21.123355 7f745f7b2700 -1 *** Caught signal (Aborted) **
 in thread 7f745f7b2700


 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-osd() [0x9f63f2]
 2: (()+0xf130) [0x7f746ec9c130]
 3: (gsignal()+0x37) [0x7f746d6b65d7]
 4: (abort()+0x148) [0x7f746d6b7cc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f746dfba9b5]
 6: (()+0x5e926) [0x7f746dfb8926]
 7: (()+0x5e953) [0x7f746dfb8953]
 8: (()+0x5eb73) [0x7f746dfb8b73]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) [0xb697b7]
 10: (OSDMap::decode_classic(ceph::buffer::list::iterator&)+0x605) [0xab1a35]
 11: (OSDMap::decode(ceph::buffer::list::iterator&)+0x8c) [0xab213c]
 12: (OSDMap::decode(ceph::buffer::list&)+0x4f) [0xab425f]
 13: (OSD::handle_osd_map(MOSDMap*)+0xd3d) [0x6700ad]
 14: (OSD::_dispatch(Message*)+0x41b) [0x6732eb]
 15: (OSD::ms_dispatch(Message*)+0x277) [0x673817]
 16: (DispatchQueue::entry()+0x64a) [0xbaf77a]
 17: (DispatchQueue::DispatchThread::entry()+0xd) [0xad4f6d]
 18: (()+0x7df5) [0x7f746ec94df5]
 19: (clone()+0x6d) [0x7f746d7771ad]


all osd.{2,3,4}.log have the same backtrace and log message before they crashed.
Comment 7 Kefu Chai 2015-09-24 08:41:49 EDT
ktdreyer suspects that there could be some special config option which could trigger the problem. cause we have firefly x hammer upgrade test suites already in ceph-qa-suites.
Comment 8 Ken Dreyer (Red Hat) 2015-09-24 09:56:29 EDT
Sam, would you mind looking into this OSD crash, or else re-assigning as appropriate?
Comment 11 Kefu Chai 2015-09-24 11:51:08 EDT
sam,

upgrade/restart the monitors one after another, then upgrade and restart the OSDs one after another. this is what i learnt from Shilpa.
Comment 16 shilpa 2015-09-24 14:39 EDT
Created attachment 1076658 [details]
10.12.27.11 monstore
Comment 22 Samuel Just 2015-09-24 20:49 EDT
Created attachment 1076829 [details]
Reproducer yaml
Comment 26 Ken Dreyer (Red Hat) 2015-09-28 12:24:24 EDT
From Sam in #rh-ceph today:

This bug is caused by starting a new hammer OSD on a cluster where the mons still have maps created in dumpling. Alternatively, start a hammer OSD which has not spoken to a mon since dumpling.
Comment 27 Federico Lucifredi 2015-09-28 16:15:24 EDT
We are keeping this in the 1.3.1. release.

QE should to re-do the test, making sure that they bring their 1.1 -> 1.2 cluster up to "active+clean" before proceeding to Hammer, per Ken.

Added to Known Issue tracker (1262054) for the Doc team to add to release notes.
Comment 31 Federico Lucifredi 2015-09-29 20:25:52 EDT
re-assigning to correct known-issue tracker (1.3.0). Thanks Harish, good eyes!
Comment 37 shilpa 2015-11-06 02:04:24 EST
Verified on ceph-0.94.3-3.el7cp.x86_64

Upgraded from 1.1(RHEL6.7) - > 1.2.3 -> 1.3.1(RHEL 7.1)

No crashes found. I/O's are running fine. 

# ceph health
HEALTH_OK
Comment 39 errata-xmlrpc 2015-11-23 15:22:44 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2512
Comment 40 Siddharth Sharma 2015-11-23 16:53:36 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2066

Note You need to log in before you can comment on or make changes to this bug.