Description of problem:
After deploying Overcloud with 3 ceph nodes (export CEPHSTORAGESCALE=3), and then removing one of ceph nodes, it's not possible to create cinder volumes anymore (albeit ceph pool min size is set to 1).
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. export CEPHSTORAGESCALE=3;instack-deploy-overcloud --tuskar
2. nova stop "one-of-ceph-nodes"
3. wait for 5 minutes, then try "cinder create 1"
cinder volume status is "error"
volume is created
it seems that the issue is with ceph itself. After removing a node, it properly reports that one node is down, but after 5 minutes, "ceph df" reports 0 for "MAX AVAIL".
Setting high prio because this causes not-functional cinder service in rdo-director when a single node is removed.
It seems that it might be caused by this bug: http://tracker.ceph.com/issues/10257
as per initial report, this seems to be caused by http://tracker.ceph.com/issues/10257 which is not fixed in ceph-0.80.7-0.4.el7.x86_64
seems to be affecting ceph-0.80.8-4.el7cp.x86_64 as well
Once an updated ceph package is available, please re-test.
Just checking here... I see you're using EPEL packages in the initial bug report. We're not backporting any patches to EPEL (I guess we *could* do it.) Do you need a fix in EPEL too?
correct the initial report was mistakenly filed against Red Hat Storage despite using an RPM taken from EPEL, in comment #5 I confirmed same bug affecting RHS build: ceph-0.80.8-4.el7cp.x86_64
I might have misunderstood your comment #7; from bug #1225081 I understand this is fixed in RHS 1.3.x
Yet RDO will use the version in EPEL. Should we file another bug, against RDO, to track a fix for EPEL?
The fix that you need (https://github.com/ceph/ceph/pull/3826) is in Ceph itself, right? I'm fine with just cherry-picking that fix to the EPEL 7 Ceph package.
I think cherry-picking to EPEL would be great. We can track the cherry-pick with a BZ as well if you want, from what I understand EPEL issues should be filed using using .
I can confirm that this issue is solved in ceph-0.94.1-11.el7cp.x86_64, with this version "ceph df" returns reasonable values after removing an OSD node and "cinder create" command works.
This BZ does not need changes in the OSP Director, it is due instead to a bug in Ceph.
ceph-0.94.1-11.el7cp.x86_64 (1.3) includes the needed fixes
ceph-0.80.7-0.4.el7.x86_64 (1.2) does not and will exhibit this problem
The BZ 1225081 tracks backport of the fix from Ceph 1.3 to 1.2
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.