Bug 1401710 - RHCS 1.3.1- 0.94.3-3.el7cp - pgs stuck in remapped after recovery on cluster with many osds down
Summary: RHCS 1.3.1- 0.94.3-3.el7cp - pgs stuck in remapped after recovery on cluster...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS
Version: 1.3.3
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: 2.3
Assignee: Brad Hubbard
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-05 22:35 UTC by Vikhyat Umrao
Modified: 2020-02-14 18:15 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-02 23:16:03 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 18145 0 None None None 2016-12-05 22:36:36 UTC

Description Vikhyat Umrao 2016-12-05 22:35:18 UTC
Description of problem:

RHCS 1.3.1-  0.94.3-3.el7cp - pgs stuck in remapped after recovery on cluster with many osds down 

http://tracker.ceph.com/issues/18145

During a maintenance event in which many osds were removed and some existing osds were down the cluster went into recovery. Once recovery had completed ten pgs were left stuck in "remapped" state and only a restart of the primaries involved resolved the issue and allowed these pgs to complete peering successfully.

^^ Instead of restart we tried marking primary as down ($ceph osd down).



Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.3.1-  0.94.3-3.el7cp

Comment 5 Josh Durgin 2017-04-01 01:21:21 UTC
Included in 10.2.3 upstream.

Comment 6 Brad Hubbard 2017-04-01 02:27:03 UTC
(In reply to Josh Durgin from comment #5)
> Included in 10.2.3 upstream.

What is Josh? 

Root cause for this was never established and I doubt it ever will be so I suspect we can close this. We have done extensive code review to try and work out how this came about but were not able to come up with a viable theory.

I'll tidy up the upstream tracker and probably close it, and this, on Monday.

Comment 7 Brad Hubbard 2017-04-02 23:16:03 UTC
We have tried extensively to reproduce this issue as well as doing an exhaustive code review and have not been able to determine the cause. Closing this for now but please feel free to reopen if more information comes to light.


Note You need to log in before you can comment on or make changes to this bug.