When upgrading from hammer to jewel, the OSDMap encoding format changed, leading the OSDs to get a 'failed to encode map with expected crc' (or similar) error, which leads them to request a full man from the mon. This generally works, but can DOS the mon in a large cluster.
The workaround for hammer->jewel is to upgrade the OSDs before the mons. This patch is required, though, to make the new jewel OSDs encoding in a way that matches hammer to avoid the problem.
Large clusters can DOS the mon during 1.3 -> 2.y upgrade, leading to loss of availability during the upgrade.
For small clusters it's not a problem. For large clusters it is. I would recommend this for any cluster >250 OSDs, and *strongly* recommend delaying any upgrade to 2.y for any cluster >500 OSDs until this fix is available. (These are pretty arbitrary numbers.)
*** Bug 1379027 has been marked as a duplicate of this bug. ***
Upgraded on an 84 osd cluster and " grep "failed to encode map with expected crc" ./*" was not found in any of the ceph.log.
Hence marking this as verified.
Verified on 10.2.3-17.el7cp.x86_64
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.