Created attachment 1203864 [details] Network spike on dnvrco01-cephmon-001 during upgrade Description of problem: When attempting to upgrade a Ceph cluster from 94.6 to 94.9 a serious performance issue is seen every time an OSD is restarted in large clusters. The monitors are already upgraded and running 94.9, when restarting the OSD's as part of the upgrade it causes several minutes of network saturation on all three monitor nodes. This causes thousands of slow requests. Initially monitor logs were flooded with the following messages: 2016-09-14 15:51:12.174478 osd.405 24.161.248.95:6805/41332 329 : cluster [WRN] failed to encode map e727238 with expected crc 2016-09-14 15:51:12.174635 osd.220 24.161.248.119:6816/92203 301 : cluster [WRN] failed to encode map e727238 with expected crc 2016-09-14 15:51:12.178740 osd.872 24.161.248.104:6816/235917 55 : cluster [WRN] failed to encode map e727238 with expected crc But 'clog_to_monitors false' was set and this is no longer occuring but network still gets saturated during restarts of OSD's. Above issue is discussed on the following community thread: http://ceph-users.ceph.narkive.com/rPGrATpE/v0-94-7-hammer-released It appears that starting with 0.94.7 that the osdmap encoding changed (which was unexpected by developers). When this happens all the 0.94.6 OSDs report the crc problem back to the mons, but the newer 0.94.9 OSDs don't. Ceph users list discussion on this current issue: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013216.html Current theory is that downrev OSD's appear to be continually pulling osdmaps from the upgraded mons. - Opening Downstream Bugzilla as it appears an upgraded from 1.3.2 to RHCS 2.0 on large clusters may also be susceptible to this issue. Version-Release number of selected component (if applicable): 1.3.2 Additional info: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013216.html https://www.mail-archive.com/ceph-users@lists.ceph.com/msg30783.html
Created attachment 1203865 [details] Same network spike with packets/sec also included.
Upstream tracker 17386 updated with the following details: "It appears that starting with 0.94.7 that the osdmap encoding changed (which was unexpected by developers" the CRC mismatch warning is expected: pg_pool_t is a field in OSDMap::Incremental, and OSDMap itself. in 0.94.6, pg_pool_t is encoded with v17 scheme, while in 0.94.9, this structure is encoded using v21. after upgrade, the monitors encode the (inc) osdmap using the new scheme, while OSD running 0.94.6 is still re-encoding the full osdmap using the v17, and then compare the crc of the re-encoded full map with the crc of the original fullmap encoded using v21. that's why the CRCs mismatch. in a large cluster, resending the fullmap could be burden to monitor and saturates the cluster network. maybe we can have we do have the machinery to re-encode osdmap for old client. but we need to do this explicitly, i.e. add CEPH_FEATURE_RESERVED (the non-exist feature bit) to the feature bits encode the MOSDMap message in OSDMonitor::send_incremental() before sending it down to messenger, which will just put the pre-encoded incremental maps and full maps into the payload buffer. (downside: larger memory foot print) or, we can add an option to disable the crc checking (or full map upon CRC mismatch) on the OSD side. so we can disable it at run-time at seeing the performance degradation due to this problem. (downside: yet another knob)
please note, upon completion of the upgrade of the cluster after installing monitor with the hotfix, user can opt to rollback to the monitor without the fix, or just keep the hotfix version. and the fix only kicks in if the peer does not have the GMT_HITSET feature bit.