Bug 1457231

Summary: osds are down after node restart
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Martin Kudlej <mkudlej>
Component: Ceph-AnsibleAssignee: Sébastien Han <shan>
Status: CLOSED CANTFIX QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.3CC: adeza, aschoen, ceph-eng-bugs, dzafman, gmeno, kchai, mkudlej, nthomas, sankarshan, seb, uboppana
Target Milestone: rc   
Target Release: 2.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-02 15:31:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin Kudlej 2017-05-31 11:40:15 UTC
Description of problem:
I've installed testing Ceph version via ceph-ansible and after node restart I see that many OSDs are down. I don't see this behavior with latest stable version. Osds were installed with collocated journals.
I see this message in osd log:
...with nothing to send, going to standby..

Version-Release number of selected component (if applicable):
ceph-osd-10.2.7-23.el7cp.x86_64.rpm 

How reproducible:
100%

Steps to Reproduce:
1. install OSDs with collocated journal via ceph-ansible
2. restart nodes in cluster
3. check OSDs status for example by "ceph osd tree"

Actual results:
After node restart many OSDs are down.

Expected results:
After restart all OSDs are up.

Comment 3 seb 2017-07-27 12:29:42 UTC
I need more info on this:

* can you check if the systemd unit is enabled?
* the title diverges from the description, are all the osds down or some of them?

Thanks!

Comment 4 Martin Kudlej 2017-08-01 07:11:52 UTC
I have not installed cluster right now, so answer for first question is just guess. I think that if Ceph is installed by Ceph Ansible all systemd Ceph units should be enabled. Also because of some Osds were up after restart I think that they are enabled.

I'm sure by second answer. As I've written in description many Osds were down after restart and next restart were down different set of Osds.

Comment 5 seb 2017-08-01 09:17:43 UTC
ceph-disk is responsible for enabling osd unit files so they 'should' be enabled.

Ok thanks, if you don't have the setup anymore that's going to be difficult to debug... :(

Comment 6 Alfredo Deza 2017-08-01 12:01:12 UTC
ceph-disk cannot guarantee that all OSDs will be up after a system reboot. This has nothing to do with systemd units (in this case). Even if you manually enable all the OSD units this will still might not work correctly.

It is hard to reproduce as well. You might be able to reboot a node and all OSDs might come up.

See https://bugzilla.redhat.com/show_bug.cgi?id=1439210

From that ticket:

> We just have no idea what's going on or why at this point.

And:

> Right now I have no better theory than "udev events are not fired as they should". 

Basically: it is a known issue with ceph-disk, is not commonly related to enabling of OSD units, and there is no robust fix regardless of the numerous attempts at handling udev/systemd/ceph-disk when a system boots.

This is *not* an issue with ceph-ansible

Comment 7 seb 2017-08-02 14:50:48 UTC
should we close this then?

Comment 8 Alfredo Deza 2017-08-02 15:31:42 UTC
Closing as 'Can't Fix'. It should really be a 'known issue' though.