Bug 1378549
| Summary: | RHCS 1.3: Upgrading 0.94.6 -> 0.94.9 saturating mon node networking | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Mike Hackett <mhackett> | ||||||
| Component: | RADOS | Assignee: | Kefu Chai <kchai> | ||||||
| Status: | CLOSED WONTFIX | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 1.3.2 | CC: | ceph-eng-bugs, dzafman, kchai, kdreyer, mhackett, sweil, vikumar, vumrao | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | 1.3.4 | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 1379027 (view as bug list) | Environment: | |||||||
| Last Closed: | 2018-01-31 04:18:58 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Mike Hackett
2016-09-22 17:38:02 UTC
Created attachment 1203865 [details]
Same network spike with packets/sec also included.
Upstream tracker 17386 updated with the following details:
"It appears that starting with 0.94.7 that the osdmap encoding changed (which was unexpected by developers"
the CRC mismatch warning is expected:
pg_pool_t is a field in OSDMap::Incremental, and OSDMap itself. in 0.94.6, pg_pool_t is encoded with v17 scheme, while in 0.94.9, this structure is encoded using v21. after upgrade, the monitors encode the (inc) osdmap using the new scheme, while OSD running 0.94.6 is still re-encoding the full osdmap using the v17, and then compare the crc of the re-encoded full map with the crc of the original fullmap encoded using v21. that's why the CRCs mismatch.
in a large cluster, resending the fullmap could be burden to monitor and saturates the cluster network. maybe we can have
we do have the machinery to re-encode osdmap for old client. but we need to do this explicitly, i.e.
add CEPH_FEATURE_RESERVED (the non-exist feature bit) to the feature bits
encode the MOSDMap message in OSDMonitor::send_incremental() before sending it down to messenger, which will just put the pre-encoded incremental maps and full maps into the payload buffer. (downside: larger memory foot print)
or, we can add an option to disable the crc checking (or full map upon CRC mismatch) on the OSD side. so we can disable it at run-time at seeing the performance degradation due to this problem. (downside: yet another knob)
please note, upon completion of the upgrade of the cluster after installing monitor with the hotfix, user can opt to rollback to the monitor without the fix, or just keep the hotfix version. and the fix only kicks in if the peer does not have the GMT_HITSET feature bit. |