Created attachment 1237770 [details] all of the logs, templates and deployment command Description of problem: A deployment of 3 controller nodes, 3 compute nodes and 15 Ceph storage nodes. Each Ceph storage node should have 24 OSDs running. The overall OSD number should be 360, the deployment ended with a failure and 355 OSDs. the failure logs are: deploy_stderr: | ... chown -h ceph:ceph /dev/sda fi fi ceph-disk prepare --cluster-uuid 7c12ae5a-c871-11e6-9b00-b8ca3a66e37c /dev/sda /dev/nvme0n1 udevadm settle Version-Release number of selected component (if applicable): openstack-tripleo-puppet-elements-5.1.0-2.el7ost.noarch openstack-tripleo-ui-1.0.5-3.el7ost.noarch openstack-tripleo-image-elements-5.1.0-1.el7ost.noarch openstack-tripleo-heat-templates-5.1.0-7.el7ost.noarch python-tripleoclient-5.4.0-2.el7ost.noarch puppet-tripleo-5.4.0-3.el7ost.noarch openstack-tripleo-common-5.4.0-3.el7ost.noarch openstack-tripleo-validations-5.1.0-5.el7ost.noarch openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch How reproducible: 25% Steps to Reproduce: 1. Deploy the overcloud on a similar environment as described in the files attached Actual results: the deployment failed with lower number of OSDs deployed Expected results: the deployment passed successfully with all OSDs active Additional info:
It looks like puppet-ceph ran into a problem when trying to prepare the OSDs. I've updated this to DFG:Ceph and assigned it to gfidente for now.
TripleO reports the Overcloud deploy failed, but look at the numbers: - 360 OSDs were requested - 355 OSDs that work were provided, a 98.6% success rate It reports a failure because it didn't have 100% success, but I suspect that the Ceph cluster and the rest of the Overcloud were still be usable; just not with all available OSDs. We should test if the deployer can simply re-run the deploy command on the existing overcloud (the way they do overcloud updates) and if it gets all of the OSDs working. If that doesn't work, then I think a desired behavior would be for the deployer to be able to do this.
*** This bug has been marked as a duplicate of bug 1445436 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days