Description of problem: This was discovered by John who is using scale lab to perform DFG worklaod testing, during the tests he has used lvm scenario's and after couple of days of run due to setup issues he tried to purge the cluster using ceph-ansible and recreate them but the OSD's failed to come up, point to note here is LVM was setup manually as ceph-ansible doesn't do that. We might be missing couple of recent fixes in 3.1 and it should be also easy to recreate, so please look into this one.
Notes from thread relevant to ceph-volume issue. The ansible runs (purge and deploy) report back no errors, however I have no cluster. Looking at the OSD svcs after a purge I see they report failed, like this: ● ceph-osd Loaded: not-found (Reason: No such file or directory) Active: failed (Result: start-limit) since Wed 2018-06-20 01:59:15 UTC; 1 day 5h ago Main PID: 633019 (code=exited, status=1/FAILURE) And after the deploy it reports this: ● ceph-osd - Ceph object storage daemon osd.153 Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: disabled) Active: failed (Result: start-limit) since Wed 2018-06-20 01:59:15 UTC; 1 day 6h ago Main PID: 633019 (code=exited, status=1/FAILURE) The timestamp of "1 day 6h ago" on this failure is very odd since I performed the purge and redeploy tonite. Its almost like this msg is leftover from when the cluster lost
Created attachment 1453568 [details] purge-cluster ansible output
Created attachment 1453569 [details] provision that did not give cluster
Created attachment 1453570 [details] provision that worked, run after manual teardown and recreate LVM
It is worth noting that the purge and deploy ansible runs do not report failed tasks.
Can we get information about what ceph-ansible version was being used?
(In reply to John Harrigan from comment #5) > Created attachment 1453569 [details] > provision that did not give cluster In this attachment I see the cluster failed to restart the RGW daemons in this task "[ceph-defaults : restart ceph rgw daemon(s) - non container]". Which is why it failed. Do you have log output where the OSDs fail to start in the ceph-ansible output?
(In reply to Alfredo Deza from comment #8) > Can we get information about what ceph-ansible version was being used? ceph-ansible.noarch 3.1.0-0.1.rc5.el7cp from this build RHCEPH-3.1-RHEL-7-20180530.ci.0
(In reply to Andrew Schoen from comment #9) > (In reply to John Harrigan from comment #5) > > Created attachment 1453569 [details] > > provision that did not give cluster > > In this attachment I see the cluster failed to restart the RGW daemons in > this task "[ceph-defaults : restart ceph rgw daemon(s) - non container]". > Which is why it failed. Do you have log output where the OSDs fail to start > in the ceph-ansible output? sorry, no. Those logs are gone. This cluster is being stressed for scale and perf testing on RHCS 3.1 Cluster has been purged. here are the specific steps I went through * created partitions and LV's using custom ansible playbook (BIprovision) * deployed cluster using ceph-ansible and osd_scenario=lvm * broke cluster :( * purged cluster using purge-cluster playbook (see purge attachment) * deployed cluster using ceph-ansible and osd_scenario=lvm (see deploy no go attachment) * purged-cluster using purge-cluster playbook * ran BIprovision ansible playbook * deployed cluster using ceph-ansible and osd_scenario=lvm (success) The ansible playbook for partition and lvm work is at https://github.com/jharriga/BIprovision specifically this one - https://github.com/jharriga/BIprovision/blob/master/FS_2nvme_noCache.yml perhaps someone can recreate on smaller cluster?
Vasu (or John), I am going to have to close this as we can't really replicate the problem. We do test purge and recreate on every single scenario we have, and we test that OSDs do come up. This ticket is lacking information which is crucial to narrow down what the problem is (aside that we can't replicate): * there are no /var/log/ceph/ceph-volume* logs * the attached ansible log is not relevant to the issue reported (rgw is failing, so the play is failing) In the end, and if I am understanding correctly the last comment: a final deployment with ceph-ansible and osd_scenario=lvm is successful, which sounds correct to me. If somewhere in the middle the purge/redeploy didn't work we will need better/more details.
Feel free to re-open if you can replicate with information relevant to the OSDs not coming up. Make sure to include log files from /var/log/ceph/ceph-volume*, and if possible, please use the following env var when running: ANSIBLE_STDOUT_CALLBACK=debug