Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1401710

Summary:	RHCS 1.3.1- 0.94.3-3.el7cp - pgs stuck in remapped after recovery on cluster with many osds down
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vikhyat Umrao <vumrao>
Component:	RADOS	Assignee:	Brad Hubbard <bhubbard>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	1.3.3	CC:	ceph-eng-bugs, dzafman, jdurgin, kchai, sjust
Target Milestone:	rc
Target Release:	2.3
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-04-02 23:16:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vikhyat Umrao 2016-12-05 22:35:18 UTC

Description of problem:

RHCS 1.3.1-  0.94.3-3.el7cp - pgs stuck in remapped after recovery on cluster with many osds down 

http://tracker.ceph.com/issues/18145

During a maintenance event in which many osds were removed and some existing osds were down the cluster went into recovery. Once recovery had completed ten pgs were left stuck in "remapped" state and only a restart of the primaries involved resolved the issue and allowed these pgs to complete peering successfully.

^^ Instead of restart we tried marking primary as down ($ceph osd down).



Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.3.1-  0.94.3-3.el7cp

Comment 5 Josh Durgin 2017-04-01 01:21:21 UTC

Included in 10.2.3 upstream.

Comment 6 Brad Hubbard 2017-04-01 02:27:03 UTC

(In reply to Josh Durgin from comment #5)
> Included in 10.2.3 upstream.

What is Josh? 

Root cause for this was never established and I doubt it ever will be so I suspect we can close this. We have done extensive code review to try and work out how this came about but were not able to come up with a viable theory.

I'll tidy up the upstream tracker and probably close it, and this, on Monday.

Comment 7 Brad Hubbard 2017-04-02 23:16:03 UTC

We have tried extensively to reproduce this issue as well as doing an exhaustive code review and have not been able to determine the cause. Closing this for now but please feel free to reopen if more information comes to light.