Bug 1593834 - ceph-ansible purge/recreate fails with lvm scenario, osd's dont come up
Summary: ceph-ansible purge/recreate fails with lvm scenario, osd's dont come up
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Volume
Version: 3.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 3.1
Assignee: Alfredo Deza
QA Contact: Parikshith
URL:
Whiteboard:
Depends On:
Blocks: 1581350
TreeView+ depends on / blocked
 
Reported: 2018-06-21 16:06 UTC by Vasu Kulkarni
Modified: 2018-06-27 16:36 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-27 16:36:32 UTC
Embargoed:


Attachments (Terms of Use)
purge-cluster ansible output (140.28 KB, text/plain)
2018-06-21 17:44 UTC, John Harrigan
no flags Details
provision that did not give cluster (1.01 MB, text/plain)
2018-06-21 17:45 UTC, John Harrigan
no flags Details
provision that worked, run after manual teardown and recreate LVM (1.26 MB, text/plain)
2018-06-21 17:46 UTC, John Harrigan
no flags Details

Description Vasu Kulkarni 2018-06-21 16:06:09 UTC
Description of problem:

This was discovered by John who is using scale lab to perform DFG worklaod testing, during the tests he has used lvm scenario's and after couple of days of run due to setup issues he tried to purge the cluster using ceph-ansible and recreate them but the OSD's failed to come up, point to note here is LVM was setup manually as ceph-ansible doesn't do that.

We might be missing couple of recent fixes in 3.1 and it should be also easy to recreate, so please look into this one.

Comment 3 Vasu Kulkarni 2018-06-21 16:09:19 UTC
Notes from thread relevant to ceph-volume issue.

The ansible runs (purge and deploy) report back no errors, however I have no cluster.
Looking at the OSD svcs after a purge I see they report failed, like this:
● ceph-osd
   Loaded: not-found (Reason: No such file or directory)
   Active: failed (Result: start-limit) since Wed 2018-06-20 01:59:15 UTC; 1 day 5h ago
 Main PID: 633019 (code=exited, status=1/FAILURE)

And after the deploy it reports this:

● ceph-osd - Ceph object storage daemon osd.153
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: disabled)
   Active: failed (Result: start-limit) since Wed 2018-06-20 01:59:15 UTC; 1 day 6h ago
 Main PID: 633019 (code=exited, status=1/FAILURE)

The timestamp of "1 day 6h ago" on this failure is very odd since I performed the purge and redeploy tonite. Its almost like this msg is leftover from when the cluster lost

Comment 4 John Harrigan 2018-06-21 17:44:22 UTC
Created attachment 1453568 [details]
purge-cluster ansible output

Comment 5 John Harrigan 2018-06-21 17:45:10 UTC
Created attachment 1453569 [details]
provision that did not give cluster

Comment 6 John Harrigan 2018-06-21 17:46:06 UTC
Created attachment 1453570 [details]
provision that worked, run after manual teardown and recreate LVM

Comment 7 John Harrigan 2018-06-21 17:48:14 UTC
It is worth noting that the purge and deploy ansible runs do not report failed tasks.

Comment 8 Alfredo Deza 2018-06-22 14:40:16 UTC
Can we get information about what ceph-ansible version was being used?

Comment 9 Andrew Schoen 2018-06-22 14:52:34 UTC
(In reply to John Harrigan from comment #5)
> Created attachment 1453569 [details]
> provision that did not give cluster

In this attachment I see the cluster failed to restart the RGW daemons in this task "[ceph-defaults : restart ceph rgw daemon(s) - non container]". Which is why it failed. Do you have log output where the OSDs fail to start in the ceph-ansible output?

Comment 10 John Harrigan 2018-06-22 15:48:09 UTC
(In reply to Alfredo Deza from comment #8)
> Can we get information about what ceph-ansible version was being used?

ceph-ansible.noarch                  3.1.0-0.1.rc5.el7cp

from this build RHCEPH-3.1-RHEL-7-20180530.ci.0

Comment 11 John Harrigan 2018-06-22 15:59:32 UTC
(In reply to Andrew Schoen from comment #9)
> (In reply to John Harrigan from comment #5)
> > Created attachment 1453569 [details]
> > provision that did not give cluster
> 
> In this attachment I see the cluster failed to restart the RGW daemons in
> this task "[ceph-defaults : restart ceph rgw daemon(s) - non container]".
> Which is why it failed. Do you have log output where the OSDs fail to start
> in the ceph-ansible output?

sorry, no. Those logs are gone.
This cluster is being stressed for scale and perf testing on RHCS 3.1
Cluster has been purged. 

here are the specific steps I went through
* created partitions and LV's using custom ansible playbook (BIprovision)
* deployed cluster using ceph-ansible and osd_scenario=lvm
* broke cluster :(
* purged cluster using purge-cluster playbook (see purge attachment)
* deployed cluster using ceph-ansible and osd_scenario=lvm (see deploy no go attachment)
* purged-cluster using purge-cluster playbook
* ran BIprovision ansible playbook
* deployed cluster using ceph-ansible and osd_scenario=lvm (success)

The ansible playbook for partition and lvm work is at 
  https://github.com/jharriga/BIprovision
specifically this one - 
  https://github.com/jharriga/BIprovision/blob/master/FS_2nvme_noCache.yml


perhaps someone can recreate on smaller cluster?

Comment 12 Alfredo Deza 2018-06-27 16:31:48 UTC
Vasu (or John), I am going to have to close this as we can't really replicate the problem.

We do test purge and recreate on every single scenario we have, and we test that OSDs do come up.

This ticket is lacking information which is crucial to narrow down what the problem is (aside that we can't replicate):


* there are no /var/log/ceph/ceph-volume* logs
* the attached ansible log is not relevant to the issue reported (rgw is failing, so the play is failing)

In the end, and if I am understanding correctly the last comment: a final deployment with ceph-ansible and osd_scenario=lvm is successful, which sounds correct to me.

If somewhere in the middle the purge/redeploy didn't work we will need better/more details.

Comment 13 Alfredo Deza 2018-06-27 16:36:32 UTC
Feel free to re-open if you can replicate with information relevant to the OSDs not coming up. Make sure to include log files from /var/log/ceph/ceph-volume*, and if possible, please use the following env var when running:

  ANSIBLE_STDOUT_CALLBACK=debug


Note You need to log in before you can comment on or make changes to this bug.