Bug 1457231 - osds are down after node restart
Summary: osds are down after node restart
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 2.3
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: rc
: 2.5
Assignee: Sébastien Han
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-31 11:40 UTC by Martin Kudlej
Modified: 2022-02-21 18:07 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-02 15:31:42 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1439210 1 None None None 2021-01-20 06:05:38 UTC

Internal Links: 1439210

Description Martin Kudlej 2017-05-31 11:40:15 UTC
Description of problem:
I've installed testing Ceph version via ceph-ansible and after node restart I see that many OSDs are down. I don't see this behavior with latest stable version. Osds were installed with collocated journals.
I see this message in osd log:
...with nothing to send, going to standby..

Version-Release number of selected component (if applicable):
ceph-osd-10.2.7-23.el7cp.x86_64.rpm 

How reproducible:
100%

Steps to Reproduce:
1. install OSDs with collocated journal via ceph-ansible
2. restart nodes in cluster
3. check OSDs status for example by "ceph osd tree"

Actual results:
After node restart many OSDs are down.

Expected results:
After restart all OSDs are up.

Comment 3 seb 2017-07-27 12:29:42 UTC
I need more info on this:

* can you check if the systemd unit is enabled?
* the title diverges from the description, are all the osds down or some of them?

Thanks!

Comment 4 Martin Kudlej 2017-08-01 07:11:52 UTC
I have not installed cluster right now, so answer for first question is just guess. I think that if Ceph is installed by Ceph Ansible all systemd Ceph units should be enabled. Also because of some Osds were up after restart I think that they are enabled.

I'm sure by second answer. As I've written in description many Osds were down after restart and next restart were down different set of Osds.

Comment 5 seb 2017-08-01 09:17:43 UTC
ceph-disk is responsible for enabling osd unit files so they 'should' be enabled.

Ok thanks, if you don't have the setup anymore that's going to be difficult to debug... :(

Comment 6 Alfredo Deza 2017-08-01 12:01:12 UTC
ceph-disk cannot guarantee that all OSDs will be up after a system reboot. This has nothing to do with systemd units (in this case). Even if you manually enable all the OSD units this will still might not work correctly.

It is hard to reproduce as well. You might be able to reboot a node and all OSDs might come up.

See https://bugzilla.redhat.com/show_bug.cgi?id=1439210

From that ticket:

> We just have no idea what's going on or why at this point.

And:

> Right now I have no better theory than "udev events are not fired as they should". 

Basically: it is a known issue with ceph-disk, is not commonly related to enabling of OSD units, and there is no robust fix regardless of the numerous attempts at handling udev/systemd/ceph-disk when a system boots.

This is *not* an issue with ceph-ansible

Comment 7 seb 2017-08-02 14:50:48 UTC
should we close this then?

Comment 8 Alfredo Deza 2017-08-02 15:31:42 UTC
Closing as 'Can't Fix'. It should really be a 'known issue' though.


Note You need to log in before you can comment on or make changes to this bug.