Bug 1378549

Summary: RHCS 1.3: Upgrading 0.94.6 -> 0.94.9 saturating mon node networking
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Mike Hackett <mhackett>
Component: RADOSAssignee: Kefu Chai <kchai>
Status: CLOSED WONTFIX QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 1.3.2CC: ceph-eng-bugs, dzafman, kchai, kdreyer, mhackett, sweil, vikumar, vumrao
Target Milestone: rc   
Target Release: 1.3.4   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1379027 (view as bug list) Environment:
Last Closed: 2018-01-31 04:18:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Network spike on dnvrco01-cephmon-001 during upgrade
none
Same network spike with packets/sec also included. none

Description Mike Hackett 2016-09-22 17:38:02 UTC
Created attachment 1203864 [details]
Network spike on dnvrco01-cephmon-001 during upgrade

Description of problem:
When attempting to upgrade a Ceph cluster from 94.6 to 94.9 a serious performance issue is seen every time an OSD is restarted in large clusters. 

The monitors are already upgraded and running 94.9, when restarting the OSD's as part of the upgrade it causes several minutes of network saturation on all three monitor nodes. This causes thousands of slow requests.

Initially monitor logs were flooded with the following messages:

2016-09-14 15:51:12.174478 osd.405 24.161.248.95:6805/41332 329 : cluster [WRN] failed to encode map e727238 with expected crc
2016-09-14 15:51:12.174635 osd.220 24.161.248.119:6816/92203 301 : cluster [WRN] failed to encode map e727238 with expected crc
2016-09-14 15:51:12.178740 osd.872 24.161.248.104:6816/235917 55 : cluster [WRN] failed to encode map e727238 with expected crc

But 'clog_to_monitors false' was set and this is no longer occuring but network still gets saturated during restarts of OSD's.

Above issue is discussed on the following community thread:
http://ceph-users.ceph.narkive.com/rPGrATpE/v0-94-7-hammer-released

It appears that starting with 0.94.7 that the osdmap encoding changed (which was unexpected by developers). When this happens all the 0.94.6 OSDs report the crc problem back to the mons, but the newer 0.94.9 OSDs don't.

Ceph users list discussion on this current issue:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013216.html

Current theory is that downrev OSD's appear to be continually pulling osdmaps from the upgraded mons.

- Opening Downstream Bugzilla as it appears an upgraded from 1.3.2 to RHCS 2.0 on large clusters may also be susceptible to this issue.

Version-Release number of selected component (if applicable):
1.3.2


Additional info:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013216.html
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg30783.html

Comment 3 Mike Hackett 2016-09-22 17:38:42 UTC
Created attachment 1203865 [details]
Same network spike with packets/sec also included.

Comment 7 Mike Hackett 2016-09-23 14:50:34 UTC
Upstream tracker 17386 updated with the following details:

    "It appears that starting with 0.94.7 that the osdmap encoding changed (which was unexpected by developers"

the CRC mismatch warning is expected:

pg_pool_t is a field in OSDMap::Incremental, and OSDMap itself. in 0.94.6, pg_pool_t is encoded with v17 scheme, while in 0.94.9, this structure is encoded using v21. after upgrade, the monitors encode the (inc) osdmap using the new scheme, while OSD running 0.94.6 is still re-encoding the full osdmap using the v17, and then compare the crc of the re-encoded full map with the crc of the original fullmap encoded using v21. that's why the CRCs mismatch.

in a large cluster, resending the fullmap could be burden to monitor and saturates the cluster network. maybe we can have

    we do have the machinery to re-encode osdmap for old client. but we need to do this explicitly, i.e.
        add CEPH_FEATURE_RESERVED (the non-exist feature bit) to the feature bits
        encode the MOSDMap message in OSDMonitor::send_incremental() before sending it down to messenger, which will just put the pre-encoded incremental maps and full maps into the payload buffer. (downside: larger memory foot print)
    or, we can add an option to disable the crc checking (or full map upon CRC mismatch) on the OSD side. so we can disable it at run-time at seeing the performance degradation due to this problem. (downside: yet another knob)

Comment 28 Kefu Chai 2016-09-30 03:16:03 UTC
please note, upon completion of the upgrade of the cluster after installing monitor with the hotfix, user can opt to rollback to the monitor without the fix, or just keep the hotfix version. and the fix only kicks in if the peer does not have the GMT_HITSET feature bit.