Created attachment 1237770 [details]
all of the logs, templates and deployment command
Description of problem:
A deployment of 3 controller nodes, 3 compute nodes and 15 Ceph storage nodes. Each Ceph storage node should have 24 OSDs running.
The overall OSD number should be 360, the deployment ended with a failure and 355 OSDs.
the failure logs are:
chown -h ceph:ceph /dev/sda
ceph-disk prepare --cluster-uuid 7c12ae5a-c871-11e6-9b00-b8ca3a66e37c /dev/sda /dev/nvme0n1
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Deploy the overcloud on a similar environment as described in the files attached
the deployment failed with lower number of OSDs deployed
the deployment passed successfully with all OSDs active
It looks like puppet-ceph ran into a problem when trying to prepare the OSDs. I've updated this to DFG:Ceph and assigned it to gfidente for now.
TripleO reports the Overcloud deploy failed, but look at the numbers:
- 360 OSDs were requested
- 355 OSDs that work were provided, a 98.6% success rate
It reports a failure because it didn't have 100% success, but I suspect that the Ceph cluster and the rest of the Overcloud were still be usable; just not with all available OSDs.
We should test if the deployer can simply re-run the deploy command on the existing overcloud (the way they do overcloud updates) and if it gets all of the OSDs working. If that doesn't work, then I think a desired behavior would be for the deployer to be able to do this.
*** This bug has been marked as a duplicate of bug 1445436 ***