Bug 1402185

Summary: jewel: osd: condition OSDMap encoding on features
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Ken Dreyer (Red Hat) <kdreyer>
Component: RADOSAssignee: Samuel Just <sjust>
Status: CLOSED ERRATA QA Contact: shylesh <shmohan>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.1CC: ceph-eng-bugs, dzafman, kchai, kdreyer, kurs, sjust, sweil, vumrao
Target Milestone: rc   
Target Release: 2.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-10.2.3-16.el7cp Ubuntu: ceph_10.2.3-17redhat1xenial Doc Type: Bug Fix
Doc Text:
Due to changes in encoding of the OSD map in the ceph package version 10.2.2, upgrading from Red Hat Ceph Storage 1.3 to 2.0 sometimes led to serious performance issues on large clusters that contain hundreds of OSDs. With this update, the underlying source code has been improved, and upgrading from 1.3 to 2.0 works as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-15 16:49:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 4 Sage Weil 2016-12-07 02:00:04 UTC
The problem:

When upgrading from hammer to jewel, the OSDMap encoding format changed, leading the OSDs to get a 'failed to encode map with expected crc' (or similar) error, which leads them to request a full man from the mon.  This generally works, but can DOS the mon in a large cluster.

The workaround for hammer->jewel is to upgrade the OSDs before the mons.  This patch is required, though, to make the new jewel OSDs encoding in a way that matches hammer to avoid the problem.


Customer impact:

Large clusters can DOS the mon during 1.3 -> 2.y upgrade, leading to loss of availability during the upgrade.


How widespread:

For small clusters it's not a problem. For large clusters it is.  I would recommend this for any cluster >250 OSDs, and *strongly* recommend delaying any upgrade to 2.y for any cluster >500 OSDs until this fix is available.  (These are pretty arbitrary numbers.)

Comment 6 Ken Dreyer (Red Hat) 2016-12-07 03:29:09 UTC
*** Bug 1379027 has been marked as a duplicate of this bug. ***

Comment 17 shylesh 2016-12-15 07:59:12 UTC
 Upgraded on an 84 osd cluster and  " grep "failed to encode map with expected crc" ./*" was not found in any of the ceph.log.

Hence marking this as verified.

Verified on 10.2.3-17.el7cp.x86_64

Comment 19 errata-xmlrpc 2016-12-15 16:49:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2954.html