Bug 1457231

Summary:	osds are down after node restart
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Martin Kudlej <mkudlej>
Component:	Ceph-Ansible	Assignee:	Sébastien Han <shan>
Status:	CLOSED CANTFIX	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	2.3	CC:	adeza, aschoen, ceph-eng-bugs, dzafman, gmeno, kchai, mkudlej, nthomas, sankarshan, seb, uboppana
Target Milestone:	rc
Target Release:	2.5
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-02 15:31:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Martin Kudlej 2017-05-31 11:40:15 UTC

Description of problem:
I've installed testing Ceph version via ceph-ansible and after node restart I see that many OSDs are down. I don't see this behavior with latest stable version. Osds were installed with collocated journals.
I see this message in osd log:
...with nothing to send, going to standby..

Version-Release number of selected component (if applicable):
ceph-osd-10.2.7-23.el7cp.x86_64.rpm 

How reproducible:
100%

Steps to Reproduce:
1. install OSDs with collocated journal via ceph-ansible
2. restart nodes in cluster
3. check OSDs status for example by "ceph osd tree"

Actual results:
After node restart many OSDs are down.

Expected results:
After restart all OSDs are up.

Comment 3 seb 2017-07-27 12:29:42 UTC

I need more info on this:

* can you check if the systemd unit is enabled?
* the title diverges from the description, are all the osds down or some of them?

Thanks!

Comment 4 Martin Kudlej 2017-08-01 07:11:52 UTC

I have not installed cluster right now, so answer for first question is just guess. I think that if Ceph is installed by Ceph Ansible all systemd Ceph units should be enabled. Also because of some Osds were up after restart I think that they are enabled.

I'm sure by second answer. As I've written in description many Osds were down after restart and next restart were down different set of Osds.

Comment 5 seb 2017-08-01 09:17:43 UTC

ceph-disk is responsible for enabling osd unit files so they 'should' be enabled.

Ok thanks, if you don't have the setup anymore that's going to be difficult to debug... :(

Comment 6 Alfredo Deza 2017-08-01 12:01:12 UTC

ceph-disk cannot guarantee that all OSDs will be up after a system reboot. This has nothing to do with systemd units (in this case). Even if you manually enable all the OSD units this will still might not work correctly.

It is hard to reproduce as well. You might be able to reboot a node and all OSDs might come up.

See https://bugzilla.redhat.com/show_bug.cgi?id=1439210

From that ticket:

> We just have no idea what's going on or why at this point.

And:

> Right now I have no better theory than "udev events are not fired as they should". 

Basically: it is a known issue with ceph-disk, is not commonly related to enabling of OSD units, and there is no robust fix regardless of the numerous attempts at handling udev/systemd/ceph-disk when a system boots.

This is *not* an issue with ceph-ansible

Comment 7 seb 2017-08-02 14:50:48 UTC

should we close this then?

Comment 8 Alfredo Deza 2017-08-02 15:31:42 UTC

Closing as 'Can't Fix'. It should really be a 'known issue' though.