Bug 1311273

Summary: After install, OSDs do not come up. They need to be restarted.
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Warren <wusui>
Component: Ceph-DiskAssignee: Kefu Chai <kchai>
Status: CLOSED WONTFIX QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.2.3CC: adeza, anharris, bengland, ceph-eng-bugs, jharriga, kdreyer, mchristi
Target Milestone: rc   
Target Release: 1.3.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-20 20:48:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Warren 2016-02-23 19:18:03 UTC
Description of problem:
After bringing up a ceph cluster, ceph health shows:

HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds

On the OSD, ps shows no osd daemons are running.

In order to get out of this state, I needed to:
/etc/init.d/ceph stop
/etc/init.d/ceph start

Note:  Just doing the start appeared to work but did not get me out of this state.

Version-Release number of selected component (if applicable):
1.2.3 -- for sure.  I believe that I have also seen this in 1.3

How reproducible:
100% of the last few times that I tried this.

Steps to Reproduce:
1. Bring up a cluster using ceph-deploy:
2. on the monitor node, run sudo ceph health
3. Note the HEALTH_ERR
3. on the osds, run /etc/init.d/ceph stop; /etc/init.d/ceph start
4. Watch the health of the ceph cluster improve.

Actual results:
HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds


Expected results:
HEALTH_OK

Additional info:
Executing the ceph stop and start on the OSDs is a workaround to this problem.

Comment 2 Ben England 2017-07-28 11:21:31 UTC
am elevating this to high priority.  It's still happening on RHCS 2.3 on John Harrigan's BAGL cluster, and I've seen it in the scale lab at times too.  I should have bz'ed it sooner.  Is this part of QE regression testing? 

This problem causes massive data movement at best and data unavailability at worst, and would disrupt any application running on the cluster.

My workaround is:

for d in /dev/sd[b-z] ; do ceph-disk activate ${d}1 ; done

What's weird is that some of the drives come up and some do not, almost like a race condition.  We'll try to get more data for you, including a sosreport and some system status.

John, is this reproducible?  Cluster was installed with ceph-ansible.

Comment 4 Ben England 2017-07-28 17:28:50 UTC
this is the wrong bz to put this in, though it is remotely related (OSDs failing to start).  John Harrigan is going to create a new bz for this problem.  Lowering priority of this bz.  It wasn't clear to me whether there was a missing step in the installation procedure above that would have activated the OSDs, sort of like ceph-disk activate only with ceph deploy.