Bug 1401710

Summary: RHCS 1.3.1- 0.94.3-3.el7cp - pgs stuck in remapped after recovery on cluster with many osds down
Product: Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RADOSAssignee: Brad Hubbard <bhubbard>
Status: CLOSED INSUFFICIENT_DATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.3.3CC: ceph-eng-bugs, dzafman, jdurgin, kchai, sjust
Target Milestone: rc   
Target Release: 2.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-02 23:16:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Vikhyat Umrao 2016-12-05 22:35:18 UTC
Description of problem:

RHCS 1.3.1-  0.94.3-3.el7cp - pgs stuck in remapped after recovery on cluster with many osds down 

http://tracker.ceph.com/issues/18145

During a maintenance event in which many osds were removed and some existing osds were down the cluster went into recovery. Once recovery had completed ten pgs were left stuck in "remapped" state and only a restart of the primaries involved resolved the issue and allowed these pgs to complete peering successfully.

^^ Instead of restart we tried marking primary as down ($ceph osd down).



Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.3.1-  0.94.3-3.el7cp

Comment 5 Josh Durgin 2017-04-01 01:21:21 UTC
Included in 10.2.3 upstream.

Comment 6 Brad Hubbard 2017-04-01 02:27:03 UTC
(In reply to Josh Durgin from comment #5)
> Included in 10.2.3 upstream.

What is Josh? 

Root cause for this was never established and I doubt it ever will be so I suspect we can close this. We have done extensive code review to try and work out how this came about but were not able to come up with a viable theory.

I'll tidy up the upstream tracker and probably close it, and this, on Monday.

Comment 7 Brad Hubbard 2017-04-02 23:16:03 UTC
We have tried extensively to reproduce this issue as well as doing an exhaustive code review and have not been able to determine the cause. Closing this for now but please feel free to reopen if more information comes to light.