Bug 1330035

Summary: Even after removing the trouble some OSD, still seeing in-consistent PGs
Product: Red Hat Ceph Storage Reporter: Tanay Ganguly <tganguly>
Component: RADOSAssignee: Kefu Chai <kchai>
Status: CLOSED WONTFIX QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.0CC: ceph-eng-bugs, dzafman, hnallurv, kchai, kdreyer, kurs
Target Milestone: rc   
Target Release: 2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-10 07:32:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Tanay Ganguly 2016-04-25 10:27:58 UTC
Description of problem:
Need to do a Deep Scrub to remove the inconsistent PG list

Version-Release number of selected component (if applicable):
10.1.1.1

How reproducible:
Always

Steps to Reproduce:
1. Had a cluster which had 15 PG's as inconsistent.
2. Identified the problem was with 1 particular Disk which have gone BAD.
3. Removed that particular OSD from Crush, data re-balance took place, but still those 15 PG's was showing as inconsistent.
4. And if i query 
rados list-inconsistent-obj 6.59
[]error 2: (2) No such file or directory

I was seeing error because that particular OSD was not there.

Actual results:
Needed to do deep-scrub on those inconsistent PG's to make my cluster clean.

Expected results:
It should have been taken care automatically.

Additional info:

Comment 3 Kefu Chai 2016-05-10 07:32:42 UTC
this problem is two folded:

still marked inconsistent after removing the bad OSD
====================================================

we share the monitor with current status after scrubbing. but we don't clean the PG_STATE_INCONSISTENT flag after peering. as we don't track why/who caused the inconsistency, and revert the flag once the bad guy is gone. it would be very tricky if we want to do this way. so a stupid and safer approach is to keep that flag until it is reset with a deep scrub which set it.



rados list-inconsistent-obj
===========================

"rados list-inconsistent-obj" targets the primary osd for getting the latest scrub result

- after the peering, the interval changed, so the object for storing the result of last scrub is zapped. that's why we have empty return value.
- and since the command does not send the epoch # as should the scrub script. we can hardly check if this inconsistency is outdated or not.

Not a blocker - recommend moving to 2.z

Comment 4 Ken Dreyer (Red Hat) 2016-05-10 13:14:36 UTC
Kefu can you please confirm that you meant to close this one as NOTABUG? The previous comment says "recommend moving to 2.z", so I wanted to double-check this.

Comment 5 Kefu Chai 2016-05-11 06:29:34 UTC
Ken, yes, I confirm.

sorry for the confusion. I forgot to remove that line after editing the reasons to close this bug as NOTABUG.

Comment 6 Tanay Ganguly 2016-05-13 09:21:05 UTC
Hi Kefu,

I think its a BUG but as designed.
Should we mark it as NOTABUG ?

Thanks,
Tanay

Comment 7 Kefu Chai 2016-05-25 08:13:17 UTC
Tanay, sorry for the latency. makes sense. changing it to WONTFIX.