1265973 – After an upgrade from 1.1 to 1.3 through 1.2.3, OSD process is crashing.

Bug 1265973 - After an upgrade from 1.1 to 1.3 through 1.2.3, OSD process is crashing.

Summary: After an upgrade from 1.1 to 1.3 through 1.2.3, OSD process is crashing.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	1.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	1.3.1
Assignee:	Samuel Just
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1230323
TreeView+	depends on / blocked

Reported:	2015-09-24 08:47 UTC by shilpa
Modified:	2022-07-09 07:50 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ceph-0.94.3-2.el7cp (RHEL) Ceph v0.94.3.2 (Ubuntu)
Doc Type:	Known Issue
Doc Text:	OSD fails after upgrading from Ceph version 1.1 to 1.3 When Ceph version 1.3 creates a new Object Storage Device (OSD) on a Ceph cluster where monitors still have maps created with Ceph version 1.1, or the new OSD has not communicated with a monitor since upgrading to version 1.1, the OSD process terminates unexpectedly.
Clone Of:
Environment:
Last Closed:	2015-11-23 20:22:44 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
10.12.27.11 monstore (1.22 MB, application/x-tar) 2015-09-24 18:39 UTC, shilpa	no flags	Details
Reproducer yaml (2.87 KB, text/plain) 2015-09-25 00:49 UTC, Samuel Just	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	13234	None	None	None	Never
Red Hat Issue Tracker	RHCEPH-4701	None	None	None	2022-07-09 07:50:02 UTC
Red Hat Product Errata	RHSA-2015:2066	normal	SHIPPED_LIVE	Moderate: Red Hat Ceph Storage 1.3.1 security, bug fix, and enhancement update	2015-11-24 02:34:55 UTC
Red Hat Product Errata	RHSA-2015:2512	normal	SHIPPED_LIVE	Moderate: Red Hat Ceph Storage 1.3.1 security, bug fix, and enhancement update	2016-02-03 03:15:52 UTC

Comment 6 Kefu Chai 2015-09-24 10:57:01 UTC

=== on mon (mon.hp-ms-01-c10, 10.12.27.10:6789) side  ====

2015-09-23 12:45:53.831426 7ffca1f797c0  0 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-mon, pid 2905
2015-09-23 12:45:54.111469 7ffca1f797c0  0 starting mon.hp-ms-01-c10 rank 0 at 10.12.27.10:6789/0 mon_data /var/lib/ceph/mon/ceph-hp-ms-01-c10 fsid 59ee50d3-6435-4bdd-9ddc-e15a2556b592
...
2015-09-24 03:26:44.830124 7ffc9a606700  1 -- 10.12.27.10:6789/0 <== osd.4 10.12.27.13:6800/5284 7 ==== mon_get_version(what=osdmap handle=1) v1 ==== 18+0+0 (4194021778 0 0) 0x5528b40 con 0x61c98c0
2015-09-24 03:26:44.830263 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- mon_check_map_ack(handle=1 version=102) v2 -- ?+0 0x552bde0 con 0x61c98c0
2015-09-24 03:26:44.831461 7ffc9a606700  1 -- 10.12.27.10:6789/0 <== osd.4 10.12.27.13:6800/5284 8 ==== mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=1}) v2 ==== 69+0+0 (3916052530 0 0) 0x5b6e800 con 0x61c98c0
2015-09-24 03:26:44.841197 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- osd_map(1..101 src has 1..102) v3 -- ?+0 0x683e540 con 0x61c98c0
2015-09-24 03:26:44.841303 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- mon_subscribe_ack(300s) v1 -- ?+0 0x5528b40 con 0x61c98c0
2015-09-24 03:26:44.844545 7ffc9a606700  1 -- 10.12.27.10:6789/0 <== osd.4 10.12.27.13:6800/5284 9 ==== mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=0}) v2 ==== 69+0+0 (492105338 0 0) 0x5b6ce00 con 0x61c98c0
2015-09-24 03:26:44.844991 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- osd_map(102..102 src has 1..102) v3 -- ?+0 0x5870000 con 0x61c98c0
2015-09-24 03:26:44.845054 7ffc9a606700  1 -- 10.12.27.10:6789/0 --> 10.12.27.13:6800/5284 -- mon_subscribe_ack(300s) v1 -- ?+0 0x552d280 con 0x61c98c0


==== on osd.4 (10.12.27.13:6800/5942) side =======

    -8> 2015-09-24 03:27:21.106064 7f7456fa1700 10 monclient: renew_subs
    -7> 2015-09-24 03:27:21.106106 7f7456fa1700 10 monclient: _send_mon_message to mon.hp-ms-01-c10 at 10.12.27.10:6789/0
    -6> 2015-09-24 03:27:21.106143 7f7456fa1700  1 -- 10.12.27.13:6800/5942 --> 10.12.27.10:6789/0 -- mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=1}) v2 -- ?+0 0x3fc4a00 con 0x3fd6460
    -5> 2015-09-24 03:27:21.116237 7f745f7b2700  1 -- 10.12.27.13:6800/5942 <== mon.0 10.12.27.10:6789/0 11 ==== osd_map(1..101 src has 1..102) v3 ==== 37644+0+0 (1705508942 0 0) 0x3fa4140 con 0x3fd6460
    -4> 2015-09-24 03:27:21.116323 7f745f7b2700 10 monclient: renew_subs
    -3> 2015-09-24 03:27:21.116344 7f745f7b2700 10 monclient: _send_mon_message to mon.hp-ms-01-c10 at 10.12.27.10:6789/0
    -2> 2015-09-24 03:27:21.116384 7f745f7b2700  1 -- 10.12.27.13:6800/5942 --> 10.12.27.10:6789/0 -- mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=0}) v2 -- ?+0 0x3fc2400 con 0x3fd6460
    -1> 2015-09-24 03:27:21.116456 7f745f7b2700  3 osd.4 0 handle_osd_map epochs [1,101], i have 0, src has [1,102]
     0> 2015-09-24 03:27:21.123355 7f745f7b2700 -1 *** Caught signal (Aborted) **
 in thread 7f745f7b2700


 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-osd() [0x9f63f2]
 2: (()+0xf130) [0x7f746ec9c130]
 3: (gsignal()+0x37) [0x7f746d6b65d7]
 4: (abort()+0x148) [0x7f746d6b7cc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f746dfba9b5]
 6: (()+0x5e926) [0x7f746dfb8926]
 7: (()+0x5e953) [0x7f746dfb8953]
 8: (()+0x5eb73) [0x7f746dfb8b73]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) [0xb697b7]
 10: (OSDMap::decode_classic(ceph::buffer::list::iterator&)+0x605) [0xab1a35]
 11: (OSDMap::decode(ceph::buffer::list::iterator&)+0x8c) [0xab213c]
 12: (OSDMap::decode(ceph::buffer::list&)+0x4f) [0xab425f]
 13: (OSD::handle_osd_map(MOSDMap*)+0xd3d) [0x6700ad]
 14: (OSD::_dispatch(Message*)+0x41b) [0x6732eb]
 15: (OSD::ms_dispatch(Message*)+0x277) [0x673817]
 16: (DispatchQueue::entry()+0x64a) [0xbaf77a]
 17: (DispatchQueue::DispatchThread::entry()+0xd) [0xad4f6d]
 18: (()+0x7df5) [0x7f746ec94df5]
 19: (clone()+0x6d) [0x7f746d7771ad]


all osd.{2,3,4}.log have the same backtrace and log message before they crashed.

Comment 7 Kefu Chai 2015-09-24 12:41:49 UTC

ktdreyer suspects that there could be some special config option which could trigger the problem. cause we have firefly x hammer upgrade test suites already in ceph-qa-suites.

Comment 8 Ken Dreyer (Red Hat) 2015-09-24 13:56:29 UTC

Sam, would you mind looking into this OSD crash, or else re-assigning as appropriate?

Comment 11 Kefu Chai 2015-09-24 15:51:08 UTC

sam,

upgrade/restart the monitors one after another, then upgrade and restart the OSDs one after another. this is what i learnt from Shilpa.

Comment 16 shilpa 2015-09-24 18:39:11 UTC

Created attachment 1076658 [details]
10.12.27.11 monstore

Comment 22 Samuel Just 2015-09-25 00:49:02 UTC

Created attachment 1076829 [details]
Reproducer yaml

Comment 26 Ken Dreyer (Red Hat) 2015-09-28 16:24:24 UTC

From Sam in #rh-ceph today:

This bug is caused by starting a new hammer OSD on a cluster where the mons still have maps created in dumpling. Alternatively, start a hammer OSD which has not spoken to a mon since dumpling.

Comment 27 Federico Lucifredi 2015-09-28 20:15:24 UTC

We are keeping this in the 1.3.1. release.

QE should to re-do the test, making sure that they bring their 1.1 -> 1.2 cluster up to "active+clean" before proceeding to Hammer, per Ken.

Added to Known Issue tracker (1262054) for the Doc team to add to release notes.

Comment 31 Federico Lucifredi 2015-09-30 00:25:52 UTC

re-assigning to correct known-issue tracker (1.3.0). Thanks Harish, good eyes!

Comment 37 shilpa 2015-11-06 07:04:24 UTC

Verified on ceph-0.94.3-3.el7cp.x86_64

Upgraded from 1.1(RHEL6.7) - > 1.2.3 -> 1.3.1(RHEL 7.1)

No crashes found. I/O's are running fine. 

# ceph health
HEALTH_OK

Comment 39 errata-xmlrpc 2015-11-23 20:22:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2512

Comment 40 Siddharth Sharma 2015-11-23 21:53:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2066

Note You need to log in before you can comment on or make changes to this bug.