Bug 1311273 - After install, OSDs do not come up. They need to be restarted.
After install, OSDs do not come up. They need to be restarted.
Status: NEW
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Installer (Show other bugs)
Unspecified Unspecified
medium Severity medium
: rc
: 1.3.4
Assigned To: Alfredo Deza
Depends On:
  Show dependency treegraph
Reported: 2016-02-23 14:18 EST by Warren
Modified: 2017-10-16 12:54 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Warren 2016-02-23 14:18:03 EST
Description of problem:
After bringing up a ceph cluster, ceph health shows:

HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds

On the OSD, ps shows no osd daemons are running.

In order to get out of this state, I needed to:
/etc/init.d/ceph stop
/etc/init.d/ceph start

Note:  Just doing the start appeared to work but did not get me out of this state.

Version-Release number of selected component (if applicable):
1.2.3 -- for sure.  I believe that I have also seen this in 1.3

How reproducible:
100% of the last few times that I tried this.

Steps to Reproduce:
1. Bring up a cluster using ceph-deploy:
2. on the monitor node, run sudo ceph health
3. Note the HEALTH_ERR
3. on the osds, run /etc/init.d/ceph stop; /etc/init.d/ceph start
4. Watch the health of the ceph cluster improve.

Actual results:
HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds

Expected results:

Additional info:
Executing the ceph stop and start on the OSDs is a workaround to this problem.
Comment 2 Ben England 2017-07-28 07:21:31 EDT
am elevating this to high priority.  It's still happening on RHCS 2.3 on John Harrigan's BAGL cluster, and I've seen it in the scale lab at times too.  I should have bz'ed it sooner.  Is this part of QE regression testing? 

This problem causes massive data movement at best and data unavailability at worst, and would disrupt any application running on the cluster.

My workaround is:

for d in /dev/sd[b-z] ; do ceph-disk activate ${d}1 ; done

What's weird is that some of the drives come up and some do not, almost like a race condition.  We'll try to get more data for you, including a sosreport and some system status.

John, is this reproducible?  Cluster was installed with ceph-ansible.
Comment 4 Ben England 2017-07-28 13:28:50 EDT
this is the wrong bz to put this in, though it is remotely related (OSDs failing to start).  John Harrigan is going to create a new bz for this problem.  Lowering priority of this bz.  It wasn't clear to me whether there was a missing step in the installation procedure above that would have activated the OSDs, sort of like ceph-disk activate only with ceph deploy.

Note You need to log in before you can comment on or make changes to this bug.