| Summary: | PG scrub bypasses 'osd_scrub_chunk_max' limit to find hash boundary | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Ceph Storage | Reporter: | Michael J. Kidd <linuxkidd> | ||||
| Component: | RADOS | Assignee: | David Zafman <dzafman> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 1.3.2 | CC: | ceph-eng-bugs, dzafman, icolle, kchai, kdreyer, linuxkidd, nlevine, sjust, tmuthami, tserlin, vumrao | ||||
| Target Milestone: | rc | ||||||
| Target Release: | 1.3.4 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1382226 1391070 (view as bug list) | Environment: | |||||
| Last Closed: | 2016-11-16 16:54:12 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1382226, 1391070 | ||||||
| Attachments: |
|
||||||
Fix will be released in RHCS 2.2 bug: https://bugzilla.redhat.com/show_bug.cgi?id=1382226 |
Created attachment 1198425 [details] OSD logs & configs for 134, 135 and 224 Description of problem: PG scrub code requires a chunk it's scrubbing to begin and end on a hash boundary. It seems in the case where it does not find a hash boundary, it keeps expanding the list (possibly by up to osd_scrub_chunk_max *PAST* the nearest hash boundary). Version-Release number of selected component (if applicable): Ceph 0.94.3 How reproducible: Sporadic, data captured during deep scrub with 'osd_scrub_sleep' set to 2 Steps to Reproduce: 1. Set 'osd_scrub_sleep' to non-zero value 2. Set 'osd_scrub_chunk_max' to a low value ( 5 in this case ) 3. Force a deep scrub on a PG with a few thousand objects in it 4. Monitor for extended periods between 'sleep' periods in the log output 5. Analyze object count between extended 'sleep' periods Actual results: PG scrub proceeded correctly for several iterations ( 5 objects read, 2 second sleep, repeat ), until 2016-09-06 09:59:51.772691, at which point 435 objects are read in, and the scrub chunk takes > 40 seconds which causes slow requests to stack up and impacts client IO. Expected results: 'osd_scrub_chunk_max' objects to be read at a time, with 'osd_scrub_sleep' seconds between chunks until all objects have been read/scrubbed Additional info: * config dumps attached * logs attached from the OSDs which were responsible for PG 3.f09 which was under deep scrub. * OSD 134 was primary and logged the sleeep / slept statements * debug_filestore 20, debug_osd 20, debug_ms 1 set during PG scrub