Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1373653

Summary:

PG scrub bypasses 'osd_scrub_chunk_max' limit to find hash boundary

Product:

[Red Hat Storage] Red Hat Ceph Storage

Reporter:

Michael J. Kidd <linuxkidd>

Component:

RADOS

Assignee:

David Zafman <dzafman>

Status:

CLOSED WONTFIX

QA Contact:

ceph-qe-bugs <ceph-qe-bugs>

Severity:

medium

Docs Contact:

Priority:

high

Version:

1.3.2

CC:

ceph-eng-bugs, dzafman, icolle, kchai, kdreyer, linuxkidd, nlevine, sjust, tmuthami, tserlin, vumrao

Target Milestone:

Target Release:

1.3.4

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1382226 1391070 (view as bug list)

Environment:

Last Closed:

2016-11-16 16:54:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1382226, 1391070

Attachments:

Description	Flags
OSD logs & configs for 134, 135 and 224	none

Description Michael J. Kidd 2016-09-06 21:38:03 UTC

Created attachment 1198425 [details]
OSD logs & configs for 134, 135 and 224

Description of problem:
PG scrub code requires a chunk it's scrubbing to begin and end on a hash boundary.
It seems in the case where it does not find a hash boundary, it keeps expanding the list (possibly by up to osd_scrub_chunk_max *PAST* the nearest hash boundary).

Version-Release number of selected component (if applicable):
Ceph 0.94.3

How reproducible:
Sporadic, data captured during deep scrub with 'osd_scrub_sleep' set to 2 

Steps to Reproduce:
1. Set 'osd_scrub_sleep' to non-zero value
2. Set 'osd_scrub_chunk_max' to a low value ( 5 in this case )
3. Force a deep scrub on a PG with a few thousand objects in it
4. Monitor for extended periods between 'sleep' periods in the log output
5. Analyze object count between extended 'sleep' periods 

Actual results:
PG scrub proceeded correctly for several iterations ( 5 objects read, 2 second sleep, repeat ), until 2016-09-06 09:59:51.772691, at which point 435 objects are read in, and the scrub chunk takes > 40 seconds which causes slow requests to stack up and impacts client IO.

Expected results:
'osd_scrub_chunk_max' objects to be read at a time, with 'osd_scrub_sleep' seconds between chunks until all objects have been read/scrubbed

Additional info:
* config dumps attached
* logs attached from the OSDs which were responsible for PG 3.f09 which was under deep scrub.
* OSD 134 was primary and logged the sleeep / slept statements
* debug_filestore 20, debug_osd 20, debug_ms 1 set during PG scrub

Comment 42 Vikhyat Umrao 2017-01-10 21:19:10 UTC

Fix will be released in RHCS 2.2 bug: https://bugzilla.redhat.com/show_bug.cgi?id=1382226