Bug 1411496 - [support][Escalation] Need additional information on how 'ceph pg repair' functions, what pg states are safe to repair
Summary: [support][Escalation] Need additional information on how 'ceph pg repair' fun...
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS
Version: 1.3.3
Hardware: All
OS: Linux
Target Milestone: rc
: 1.3.4
Assignee: David Zafman
QA Contact: ceph-qe-bugs
Depends On:
TreeView+ depends on / blocked
Reported: 2017-01-09 20:42 UTC by Kyle Squizzato
Modified: 2020-05-14 15:31 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2017-04-17 22:11:09 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1589113 0 None None None 2017-01-09 21:05:44 UTC

Description Kyle Squizzato 2017-01-09 20:42:02 UTC
Description of problem:
The customer has two questions that have to do with pg repair's: 

1) The customer discovered an inconsistent pg and issued a 'ceph pg repair' but the repair did not appear to begin performing any repair for ~11 hours.  The customer is looking for information on how the process is scheduled. 

2) The customer is interested in learning which pg states/conditions are safe to repair and which are not.  They were wondering if we could provide a list of these conditions so they could document them.

Version-Release number of selected component (if applicable):

How reproducible:

Additional info:
pg repair logs have been requested but have not yet been received.

Comment 4 Kyle Squizzato 2017-01-27 19:46:57 UTC
Further question: What happens if the inconsistent copy of the object is actually the primary copy and a client attempts to read the object? Does ceph automatically promote a different copy to primary? Or will this result in read I/O error? We have not experienced this (yet) since our workload at the time of this issue was only writes, but we'd like to know how to handle the read scenario and what type of error we can expect in the application, if any.


From my understanding if the primary copy of the object is bad and 'pg repair' is called, Ceph will replicate the object to the secondary and tertiary OSD nodes, it's not intelligent in anyway (unless this has been resolved in a later commit).  

Regarding the question above, I imagine the client would just get an IO error if the object was corrupted enough or it will attempt to read the object with some level of success. 

Is this true?

Note You need to log in before you can comment on or make changes to this bug.