=== on mon (mon.hp-ms-01-c10, 10.12.27.10:6789) side ==== 2015-09-23 12:45:53.831426 7ffca1f797c0 0 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 2905 2015-09-23 12:45:54.111469 7ffca1f797c0 0 starting mon.hp-ms-01-c10 rank 0 at 10.12.27.10:6789/0 mon_data /var/lib/ceph/mon/ceph-hp-ms-01-c10 fsid 59ee50d3-6435-4bdd-9ddc-e15a2556b592 ... 2015-09-24 03:26:44.830124 7ffc9a606700 1 -- 10.12.27.10:6789/0 <== osd.4 10.12.27.13:6800/5284 7 ==== mon_get_version(what=osdmap handle=1) v1 ==== 18+0+0 (4194021778 0 0) 0x5528b40 con 0x61c98c0 2015-09-24 03:26:44.830263 7ffc9a606700 1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- mon_check_map_ack(handle=1 version=102) v2 -- ?+0 0x552bde0 con 0x61c98c0 2015-09-24 03:26:44.831461 7ffc9a606700 1 -- 10.12.27.10:6789/0 <== osd.4 10.12.27.13:6800/5284 8 ==== mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=1}) v2 ==== 69+0+0 (3916052530 0 0) 0x5b6e800 con 0x61c98c0 2015-09-24 03:26:44.841197 7ffc9a606700 1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- osd_map(1..101 src has 1..102) v3 -- ?+0 0x683e540 con 0x61c98c0 2015-09-24 03:26:44.841303 7ffc9a606700 1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- mon_subscribe_ack(300s) v1 -- ?+0 0x5528b40 con 0x61c98c0 2015-09-24 03:26:44.844545 7ffc9a606700 1 -- 10.12.27.10:6789/0 <== osd.4 10.12.27.13:6800/5284 9 ==== mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=0}) v2 ==== 69+0+0 (492105338 0 0) 0x5b6ce00 con 0x61c98c0 2015-09-24 03:26:44.844991 7ffc9a606700 1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- osd_map(102..102 src has 1..102) v3 -- ?+0 0x5870000 con 0x61c98c0 2015-09-24 03:26:44.845054 7ffc9a606700 1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- mon_subscribe_ack(300s) v1 -- ?+0 0x552d280 con 0x61c98c0 ==== on osd.4 (10.12.27.13:6800/5942) side ======= -8> 2015-09-24 03:27:21.106064 7f7456fa1700 10 monclient: renew_subs -7> 2015-09-24 03:27:21.106106 7f7456fa1700 10 monclient: _send_mon_message to mon.hp-ms-01-c10 at 10.12.27.10:6789/0 -6> 2015-09-24 03:27:21.106143 7f7456fa1700 1 -- 10.12.27.13:6800/5942 --> 10.12.27.10:6789/0 -- mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=1}) v2 -- ?+0 0x3fc4a00 con 0x3fd6460 -5> 2015-09-24 03:27:21.116237 7f745f7b2700 1 -- 10.12.27.13:6800/5942 <== mon.0 10.12.27.10:6789/0 11 ==== osd_map(1..101 src has 1..102) v3 ==== 37644+0+0 (1705508942 0 0) 0x3fa4140 con 0x3fd6460 -4> 2015-09-24 03:27:21.116323 7f745f7b2700 10 monclient: renew_subs -3> 2015-09-24 03:27:21.116344 7f745f7b2700 10 monclient: _send_mon_message to mon.hp-ms-01-c10 at 10.12.27.10:6789/0 -2> 2015-09-24 03:27:21.116384 7f745f7b2700 1 -- 10.12.27.13:6800/5942 --> 10.12.27.10:6789/0 -- mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=0}) v2 -- ?+0 0x3fc2400 con 0x3fd6460 -1> 2015-09-24 03:27:21.116456 7f745f7b2700 3 osd.4 0 handle_osd_map epochs [1,101], i have 0, src has [1,102] 0> 2015-09-24 03:27:21.123355 7f745f7b2700 -1 *** Caught signal (Aborted) ** in thread 7f745f7b2700 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: /usr/bin/ceph-osd() [0x9f63f2] 2: (()+0xf130) [0x7f746ec9c130] 3: (gsignal()+0x37) [0x7f746d6b65d7] 4: (abort()+0x148) [0x7f746d6b7cc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f746dfba9b5] 6: (()+0x5e926) [0x7f746dfb8926] 7: (()+0x5e953) [0x7f746dfb8953] 8: (()+0x5eb73) [0x7f746dfb8b73] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) [0xb697b7] 10: (OSDMap::decode_classic(ceph::buffer::list::iterator&)+0x605) [0xab1a35] 11: (OSDMap::decode(ceph::buffer::list::iterator&)+0x8c) [0xab213c] 12: (OSDMap::decode(ceph::buffer::list&)+0x4f) [0xab425f] 13: (OSD::handle_osd_map(MOSDMap*)+0xd3d) [0x6700ad] 14: (OSD::_dispatch(Message*)+0x41b) [0x6732eb] 15: (OSD::ms_dispatch(Message*)+0x277) [0x673817] 16: (DispatchQueue::entry()+0x64a) [0xbaf77a] 17: (DispatchQueue::DispatchThread::entry()+0xd) [0xad4f6d] 18: (()+0x7df5) [0x7f746ec94df5] 19: (clone()+0x6d) [0x7f746d7771ad] all osd.{2,3,4}.log have the same backtrace and log message before they crashed.
ktdreyer suspects that there could be some special config option which could trigger the problem. cause we have firefly x hammer upgrade test suites already in ceph-qa-suites.
Sam, would you mind looking into this OSD crash, or else re-assigning as appropriate?
sam, upgrade/restart the monitors one after another, then upgrade and restart the OSDs one after another. this is what i learnt from Shilpa.
Created attachment 1076658 [details] 10.12.27.11 monstore
Created attachment 1076829 [details] Reproducer yaml
From Sam in #rh-ceph today: This bug is caused by starting a new hammer OSD on a cluster where the mons still have maps created in dumpling. Alternatively, start a hammer OSD which has not spoken to a mon since dumpling.
We are keeping this in the 1.3.1. release. QE should to re-do the test, making sure that they bring their 1.1 -> 1.2 cluster up to "active+clean" before proceeding to Hammer, per Ken. Added to Known Issue tracker (1262054) for the Doc team to add to release notes.
re-assigning to correct known-issue tracker (1.3.0). Thanks Harish, good eyes!
Verified on ceph-0.94.3-3.el7cp.x86_64 Upgraded from 1.1(RHEL6.7) - > 1.2.3 -> 1.3.1(RHEL 7.1) No crashes found. I/O's are running fine. # ceph health HEALTH_OK
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2512
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2066