Bug 1506782

Summary:	osd_scrub_auto_repair not working as expected
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	tbrekke
Component:	RADOS	Assignee:	David Zafman <dzafman>
Status:	CLOSED ERRATA	QA Contact:	Manohar Murthy <mmurthy>
Severity:	medium	Docs Contact:	Aron Gunn <agunn>
Priority:	medium
Version:	2.3	CC:	agunn, anharris, ceph-eng-bugs, dzafman, jbrier, jdurgin, jgalvez, kchai, mhackett, mmurthy, nojha, tbrekke, tchandra, tserlin, vumrao
Target Milestone:	z2
Target Release:	3.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	RHEL: ceph-12.2.8-109.el7cp Ubuntu: ceph_12.2.8-94redhat1xenial	Doc Type:	Bug Fix
Doc Text:	.A PG repair no longer sets the storage cluster to a warning state When doing a repair of a placement group (PG) it was considered a damaged PG. This was placing the storage cluster into a warning state. With this release, repairing a PG does not place the storage cluster into a warning state.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-04-30 15:56:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1629656

Description tbrekke 2017-10-26 17:56:45 UTC

Description of problem:

When osd_scrub_auto_repair=true is set, inconsistent placement groups on erasure coded pools should automatically get repaired. When testing this function out it seems every deep-scrub is triggering an repair even if not errors are reported on the pg. This is not ideal, and cases the cluster to go into a warn state which can throw alerts in customers monitoring systems. 

Version-Release number of selected component (if applicable):

RHCS 2.3

How reproducible:

Everytime

Steps to Reproduce:
1. ceph tell osd.* injectargs '--osd_scrub_auto_repair=true'
2. force deepscrub, or wait for next deepscrub

Actual results:

PG will be in repair state

Expected results:

PG should only be in repair state if there is an error.

Additional info:

It looks like the repair is cancelled when the number of errors is greater then osd_scrub_auto_repair_num_errors, but shouldn't it also be cancelled if number or errors is zero?

https://github.com/ceph/ceph/blob/jewel/src/osd/PG.cc#L4686

Comment 23 errata-xmlrpc 2019-04-30 15:56:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0911