Bug 1252158

Summary:	overcloud deploy with ceph reports success but ceph is not usable because OSD/journals not created and no ceph.conf
Product:	Red Hat OpenStack	Reporter:	jliberma <jliberma>
Component:	rhosp-director	Assignee:	John Fulton <johfulto>
Status:	CLOSED DUPLICATE	QA Contact:	Yogev Rabl <yrabl>
Severity:	unspecified	Docs Contact:
Priority:	urgent
Version:	7.0 (Kilo)	CC:	hbrock, jdonohue, jean-francois.bibeau, jefbrown, johfulto, jomurphy, jraju, mburns, mcornea, morazi, rhel-osp-director-maint, skinjo
Target Milestone:	beta
Target Release:	11.0 (Ocata)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-01-17 16:32:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1399824

Description jliberma@redhat.com 2015-08-10 20:32:25 UTC

Description of problem:

heat stack-create with ceph + templates reports success even if ceph OSDs do not create a ceph.conf.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. deploy undercloud

2. modify hiera defaults in ceph.yaml with incorrect syntax. IE --
ceph::profile::params::osds:
'/dev/sdb':
journal:
'/dev/sdc':
journal:
3. Deploy overcloud + ceph with templates:

openstack overcloud deploy -e templates/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/network-environment.yaml --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph --ntp-server 10.16.255.2 --control-scale 3 --compute-scale 4 --ceph-storage-scale 4 --block-storage-scale 0 --swift-storage-scale 0 -t 90 --templates /home/stack/templates/openstack-tripleo-heat-templates/ -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml

4. ssh to ceph controller and run "ceph -s", or "ceph health" : it will report no osds:

HEALTH_ERR 256 pgs stuck inactive; 256 pgs stuck unclean; no osds
HEALTH_ERR 256 pgs stuck inactive; 256 pgs stuck unclean; no osds
HEALTH_ERR 256 pgs stuck inactive; 256 pgs stuck unclean; no osds
HEALTH_ERR 256 pgs stuck inactive; 256 pgs stuck unclean; no osds

5. no ceph.conf created on ceph servers, no log files created, no OSDs created but service starts successfully

Actual results:

overcloud reports succesfully deployed but ceph is non-functional

systemctl status ceph.service resports successful service start but there is no ceph.conf and no OSDs on the ceph servers.

Expected results:

heat stack-create reports failed. puppet should test for existence of ceph.conf and service start status.

Additional info:

Many errors can trigger this problem.
1) incorrect syntax in ceph.yaml
2) existing fsid on OSD disks specified in ceph.yaml creates mismatch between expected and existing fsid (like after a reinstall)

Ideally a stack delete would wipe the partitions on the OSDs including the MBR.

Comment 5 Dan Prince 2015-12-05 20:43:33 UTC

Couple of thoughts here:

Is there a good way to validate ceph.conf once it is created? If so perhaps this is something we might add to puppet-ceph to make it more robust there.

With regards to the existing fsid on OSD disks should we be wiping disks clean on provisioning. Or perhaps we wipe disks clean when they get deleted. Ironic does have a clean_nodes setting (which we set to false in Instack) but we could instruct users of Ceph clusters to enable if this is a concern.

Comment 9 Mike Burns 2016-04-07 20:47:27 UTC

This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 12 seb 2016-08-23 09:27:33 UTC

I think we should fail early by checking the syntax on the ceph.yaml. However this could mean applying this kind of checks to all the other templates.
Perhaps the easiest thing to do is to validate the state of the cluster once the deployment is done?
Basically if Ceph health reports HEALTH_ERR we fail the stack and start investigating.

Comment 13 John Fulton 2016-10-12 15:47:37 UTC

*** Bug 1312192 has been marked as a duplicate of this bug. ***

Comment 14 John Fulton 2016-10-12 15:53:48 UTC

- This situation has been helped tremendously by RH 1370439, fixed in OSP10
- This bug should be closed after implementing the description in comment #12

Comment 16 John Fulton 2017-01-17 16:32:40 UTC

This issue is a symptom of what happens when you don't clean your disks during deployment or redeployment. That symptom is now captured and the deploy fails as requested here as a result of the outcome of RH 1370439. After that failure happens, the fix is to enable a new flag to zap as described in RH 1377867. Now that our fix for 1377867 in the works upstream, to zap the old disks as per Dan in comment #5, and as per a discussion in DFG:Ceph (including Seb who had added comment #12), our conclusion is to mark this as a duplicate of 1377867. 1377867 is on schedule to be fixed in OSP11.

*** This bug has been marked as a duplicate of bug 1377867 ***