Description of problem: heat stack-create with ceph + templates reports success even if ceph OSDs do not create a ceph.conf. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. deploy undercloud 2. modify hiera defaults in ceph.yaml with incorrect syntax. IE -- ceph::profile::params::osds: '/dev/sdb': journal: '/dev/sdc': journal: 3. Deploy overcloud + ceph with templates: openstack overcloud deploy -e templates/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/network-environment.yaml --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph --ntp-server 10.16.255.2 --control-scale 3 --compute-scale 4 --ceph-storage-scale 4 --block-storage-scale 0 --swift-storage-scale 0 -t 90 --templates /home/stack/templates/openstack-tripleo-heat-templates/ -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml 4. ssh to ceph controller and run "ceph -s", or "ceph health" : it will report no osds: HEALTH_ERR 256 pgs stuck inactive; 256 pgs stuck unclean; no osds HEALTH_ERR 256 pgs stuck inactive; 256 pgs stuck unclean; no osds HEALTH_ERR 256 pgs stuck inactive; 256 pgs stuck unclean; no osds HEALTH_ERR 256 pgs stuck inactive; 256 pgs stuck unclean; no osds 5. no ceph.conf created on ceph servers, no log files created, no OSDs created but service starts successfully Actual results: overcloud reports succesfully deployed but ceph is non-functional systemctl status ceph.service resports successful service start but there is no ceph.conf and no OSDs on the ceph servers. Expected results: heat stack-create reports failed. puppet should test for existence of ceph.conf and service start status. Additional info: Many errors can trigger this problem. 1) incorrect syntax in ceph.yaml 2) existing fsid on OSD disks specified in ceph.yaml creates mismatch between expected and existing fsid (like after a reinstall) Ideally a stack delete would wipe the partitions on the OSDs including the MBR.
Couple of thoughts here: Is there a good way to validate ceph.conf once it is created? If so perhaps this is something we might add to puppet-ceph to make it more robust there. With regards to the existing fsid on OSD disks should we be wiping disks clean on provisioning. Or perhaps we wipe disks clean when they get deleted. Ironic does have a clean_nodes setting (which we set to false in Instack) but we could instruct users of Ceph clusters to enable if this is a concern.
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10.
I think we should fail early by checking the syntax on the ceph.yaml. However this could mean applying this kind of checks to all the other templates. Perhaps the easiest thing to do is to validate the state of the cluster once the deployment is done? Basically if Ceph health reports HEALTH_ERR we fail the stack and start investigating.
*** Bug 1312192 has been marked as a duplicate of this bug. ***
- This situation has been helped tremendously by RH 1370439, fixed in OSP10 - This bug should be closed after implementing the description in comment #12
This issue is a symptom of what happens when you don't clean your disks during deployment or redeployment. That symptom is now captured and the deploy fails as requested here as a result of the outcome of RH 1370439. After that failure happens, the fix is to enable a new flag to zap as described in RH 1377867. Now that our fix for 1377867 in the works upstream, to zap the old disks as per Dan in comment #5, and as per a discussion in DFG:Ceph (including Seb who had added comment #12), our conclusion is to mark this as a duplicate of 1377867. 1377867 is on schedule to be fixed in OSP11. *** This bug has been marked as a duplicate of bug 1377867 ***