Bug 1402185 - jewel: osd: condition OSDMap encoding on features
Summary: jewel: osd: condition OSDMap encoding on features
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS
Version: 2.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 2.1
Assignee: Samuel Just
QA Contact: shylesh
URL:
Whiteboard:
: 1379027 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-07 00:50 UTC by Ken Dreyer (Red Hat)
Modified: 2017-07-30 15:20 UTC (History)
8 users (show)

Fixed In Version: RHEL: ceph-10.2.3-16.el7cp Ubuntu: ceph_10.2.3-17redhat1xenial
Doc Type: Bug Fix
Doc Text:
Due to changes in encoding of the OSD map in the ceph package version 10.2.2, upgrading from Red Hat Ceph Storage 1.3 to 2.0 sometimes led to serious performance issues on large clusters that contain hundreds of OSDs. With this update, the underlying source code has been improved, and upgrading from 1.3 to 2.0 works as expected.
Clone Of:
Environment:
Last Closed: 2016-12-15 16:49:21 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:2954 normal SHIPPED_LIVE Moderate: Red Hat Ceph Storage 2.1 security and bug fix update 2017-03-22 02:06:31 UTC
Ceph Project Bug Tracker 18015 None None None 2016-12-07 00:51:15 UTC
Red Hat Product Errata RHSA-2016:2956 normal SHIPPED_LIVE Moderate: Red Hat Ceph Storage 2.1 security and bug fix update 2016-12-15 23:02:58 UTC

Comment 4 Sage Weil 2016-12-07 02:00:04 UTC
The problem:

When upgrading from hammer to jewel, the OSDMap encoding format changed, leading the OSDs to get a 'failed to encode map with expected crc' (or similar) error, which leads them to request a full man from the mon.  This generally works, but can DOS the mon in a large cluster.

The workaround for hammer->jewel is to upgrade the OSDs before the mons.  This patch is required, though, to make the new jewel OSDs encoding in a way that matches hammer to avoid the problem.


Customer impact:

Large clusters can DOS the mon during 1.3 -> 2.y upgrade, leading to loss of availability during the upgrade.


How widespread:

For small clusters it's not a problem. For large clusters it is.  I would recommend this for any cluster >250 OSDs, and *strongly* recommend delaying any upgrade to 2.y for any cluster >500 OSDs until this fix is available.  (These are pretty arbitrary numbers.)

Comment 6 Ken Dreyer (Red Hat) 2016-12-07 03:29:09 UTC
*** Bug 1379027 has been marked as a duplicate of this bug. ***

Comment 17 shylesh 2016-12-15 07:59:12 UTC
 Upgraded on an 84 osd cluster and  " grep "failed to encode map with expected crc" ./*" was not found in any of the ceph.log.

Hence marking this as verified.

Verified on 10.2.3-17.el7cp.x86_64

Comment 19 errata-xmlrpc 2016-12-15 16:49:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2954.html


Note You need to log in before you can comment on or make changes to this bug.