Description of problem: I've installed testing Ceph version via ceph-ansible and after node restart I see that many OSDs are down. I don't see this behavior with latest stable version. Osds were installed with collocated journals. I see this message in osd log: ...with nothing to send, going to standby.. Version-Release number of selected component (if applicable): ceph-osd-10.2.7-23.el7cp.x86_64.rpm How reproducible: 100% Steps to Reproduce: 1. install OSDs with collocated journal via ceph-ansible 2. restart nodes in cluster 3. check OSDs status for example by "ceph osd tree" Actual results: After node restart many OSDs are down. Expected results: After restart all OSDs are up.
I need more info on this: * can you check if the systemd unit is enabled? * the title diverges from the description, are all the osds down or some of them? Thanks!
I have not installed cluster right now, so answer for first question is just guess. I think that if Ceph is installed by Ceph Ansible all systemd Ceph units should be enabled. Also because of some Osds were up after restart I think that they are enabled. I'm sure by second answer. As I've written in description many Osds were down after restart and next restart were down different set of Osds.
ceph-disk is responsible for enabling osd unit files so they 'should' be enabled. Ok thanks, if you don't have the setup anymore that's going to be difficult to debug... :(
ceph-disk cannot guarantee that all OSDs will be up after a system reboot. This has nothing to do with systemd units (in this case). Even if you manually enable all the OSD units this will still might not work correctly. It is hard to reproduce as well. You might be able to reboot a node and all OSDs might come up. See https://bugzilla.redhat.com/show_bug.cgi?id=1439210 From that ticket: > We just have no idea what's going on or why at this point. And: > Right now I have no better theory than "udev events are not fired as they should". Basically: it is a known issue with ceph-disk, is not commonly related to enabling of OSD units, and there is no robust fix regardless of the numerous attempts at handling udev/systemd/ceph-disk when a system boots. This is *not* an issue with ceph-ansible
should we close this then?
Closing as 'Can't Fix'. It should really be a 'known issue' though.