1311273 – After install, OSDs do not come up. They need to be restarted.

Bug 1311273 - After install, OSDs do not come up. They need to be restarted.

Summary: After install, OSDs do not come up. They need to be restarted.

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Disk
Sub Component:
Version:	1.2.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	1.3.4
Assignee:	Kefu Chai
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-02-23 19:18 UTC by Warren
Modified:	2018-02-20 20:48 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-02-20 20:48:42 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Warren 2016-02-23 19:18:03 UTC

Description of problem:
After bringing up a ceph cluster, ceph health shows:

HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds

On the OSD, ps shows no osd daemons are running.

In order to get out of this state, I needed to:
/etc/init.d/ceph stop
/etc/init.d/ceph start

Note:  Just doing the start appeared to work but did not get me out of this state.

Version-Release number of selected component (if applicable):
1.2.3 -- for sure.  I believe that I have also seen this in 1.3

How reproducible:
100% of the last few times that I tried this.

Steps to Reproduce:
1. Bring up a cluster using ceph-deploy:
2. on the monitor node, run sudo ceph health
3. Note the HEALTH_ERR
3. on the osds, run /etc/init.d/ceph stop; /etc/init.d/ceph start
4. Watch the health of the ceph cluster improve.

Actual results:
HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds


Expected results:
HEALTH_OK

Additional info:
Executing the ceph stop and start on the OSDs is a workaround to this problem.

Comment 2 Ben England 2017-07-28 11:21:31 UTC

am elevating this to high priority.  It's still happening on RHCS 2.3 on John Harrigan's BAGL cluster, and I've seen it in the scale lab at times too.  I should have bz'ed it sooner.  Is this part of QE regression testing? 

This problem causes massive data movement at best and data unavailability at worst, and would disrupt any application running on the cluster.

My workaround is:

for d in /dev/sd[b-z] ; do ceph-disk activate ${d}1 ; done

What's weird is that some of the drives come up and some do not, almost like a race condition.  We'll try to get more data for you, including a sosreport and some system status.

John, is this reproducible?  Cluster was installed with ceph-ansible.

Comment 4 Ben England 2017-07-28 17:28:50 UTC

this is the wrong bz to put this in, though it is remotely related (OSDs failing to start).  John Harrigan is going to create a new bz for this problem.  Lowering priority of this bz.  It wasn't clear to me whether there was a missing step in the installation procedure above that would have activated the OSDs, sort of like ceph-disk activate only with ceph deploy.

Note You need to log in before you can comment on or make changes to this bug.