Bug 1506782

Summary: osd_scrub_auto_repair not working as expected
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: tbrekke
Component: RADOSAssignee: David Zafman <dzafman>
Status: CLOSED ERRATA QA Contact: Manohar Murthy <mmurthy>
Severity: medium Docs Contact: Aron Gunn <agunn>
Priority: medium    
Version: 2.3CC: agunn, anharris, ceph-eng-bugs, dzafman, jbrier, jdurgin, jgalvez, kchai, mhackett, mmurthy, nojha, tbrekke, tchandra, tserlin, vumrao
Target Milestone: z2   
Target Release: 3.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-12.2.8-109.el7cp Ubuntu: ceph_12.2.8-94redhat1xenial Doc Type: Bug Fix
Doc Text:
.A PG repair no longer sets the storage cluster to a warning state When doing a repair of a placement group (PG) it was considered a damaged PG. This was placing the storage cluster into a warning state. With this release, repairing a PG does not place the storage cluster into a warning state.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-30 15:56:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1629656    

Description tbrekke 2017-10-26 17:56:45 UTC
Description of problem:

When osd_scrub_auto_repair=true is set, inconsistent placement groups on erasure coded pools should automatically get repaired. When testing this function out it seems every deep-scrub is triggering an repair even if not errors are reported on the pg. This is not ideal, and cases the cluster to go into a warn state which can throw alerts in customers monitoring systems. 

Version-Release number of selected component (if applicable):

RHCS 2.3

How reproducible:

Everytime

Steps to Reproduce:
1. ceph tell osd.* injectargs '--osd_scrub_auto_repair=true'
2. force deepscrub, or wait for next deepscrub

Actual results:

PG will be in repair state

Expected results:

PG should only be in repair state if there is an error.

Additional info:

It looks like the repair is cancelled when the number of errors is greater then osd_scrub_auto_repair_num_errors, but shouldn't it also be cancelled if number or errors is zero?

https://github.com/ceph/ceph/blob/jewel/src/osd/PG.cc#L4686

Comment 23 errata-xmlrpc 2019-04-30 15:56:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0911