The problem: When upgrading from hammer to jewel, the OSDMap encoding format changed, leading the OSDs to get a 'failed to encode map with expected crc' (or similar) error, which leads them to request a full man from the mon. This generally works, but can DOS the mon in a large cluster. The workaround for hammer->jewel is to upgrade the OSDs before the mons. This patch is required, though, to make the new jewel OSDs encoding in a way that matches hammer to avoid the problem. Customer impact: Large clusters can DOS the mon during 1.3 -> 2.y upgrade, leading to loss of availability during the upgrade. How widespread: For small clusters it's not a problem. For large clusters it is. I would recommend this for any cluster >250 OSDs, and *strongly* recommend delaying any upgrade to 2.y for any cluster >500 OSDs until this fix is available. (These are pretty arbitrary numbers.)
*** Bug 1379027 has been marked as a duplicate of this bug. ***
Upgraded on an 84 osd cluster and " grep "failed to encode map with expected crc" ./*" was not found in any of the ceph.log. Hence marking this as verified. Verified on 10.2.3-17.el7cp.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2954.html