Created attachment 1237770[details]
all of the logs, templates and deployment command
Description of problem:
A deployment of 3 controller nodes, 3 compute nodes and 15 Ceph storage nodes. Each Ceph storage node should have 24 OSDs running.
The overall OSD number should be 360, the deployment ended with a failure and 355 OSDs.
the failure logs are:
deploy_stderr: |
...
chown -h ceph:ceph /dev/sda
fi
fi
ceph-disk prepare --cluster-uuid 7c12ae5a-c871-11e6-9b00-b8ca3a66e37c /dev/sda /dev/nvme0n1
udevadm settle
Version-Release number of selected component (if applicable):
openstack-tripleo-puppet-elements-5.1.0-2.el7ost.noarch
openstack-tripleo-ui-1.0.5-3.el7ost.noarch
openstack-tripleo-image-elements-5.1.0-1.el7ost.noarch
openstack-tripleo-heat-templates-5.1.0-7.el7ost.noarch
python-tripleoclient-5.4.0-2.el7ost.noarch
puppet-tripleo-5.4.0-3.el7ost.noarch
openstack-tripleo-common-5.4.0-3.el7ost.noarch
openstack-tripleo-validations-5.1.0-5.el7ost.noarch
openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch
How reproducible:
25%
Steps to Reproduce:
1. Deploy the overcloud on a similar environment as described in the files attached
Actual results:
the deployment failed with lower number of OSDs deployed
Expected results:
the deployment passed successfully with all OSDs active
Additional info:
TripleO reports the Overcloud deploy failed, but look at the numbers:
- 360 OSDs were requested
- 355 OSDs that work were provided, a 98.6% success rate
It reports a failure because it didn't have 100% success, but I suspect that the Ceph cluster and the rest of the Overcloud were still be usable; just not with all available OSDs.
We should test if the deployer can simply re-run the deploy command on the existing overcloud (the way they do overcloud updates) and if it gets all of the OSDs working. If that doesn't work, then I think a desired behavior would be for the deployer to be able to do this.
Created attachment 1237770 [details] all of the logs, templates and deployment command Description of problem: A deployment of 3 controller nodes, 3 compute nodes and 15 Ceph storage nodes. Each Ceph storage node should have 24 OSDs running. The overall OSD number should be 360, the deployment ended with a failure and 355 OSDs. the failure logs are: deploy_stderr: | ... chown -h ceph:ceph /dev/sda fi fi ceph-disk prepare --cluster-uuid 7c12ae5a-c871-11e6-9b00-b8ca3a66e37c /dev/sda /dev/nvme0n1 udevadm settle Version-Release number of selected component (if applicable): openstack-tripleo-puppet-elements-5.1.0-2.el7ost.noarch openstack-tripleo-ui-1.0.5-3.el7ost.noarch openstack-tripleo-image-elements-5.1.0-1.el7ost.noarch openstack-tripleo-heat-templates-5.1.0-7.el7ost.noarch python-tripleoclient-5.4.0-2.el7ost.noarch puppet-tripleo-5.4.0-3.el7ost.noarch openstack-tripleo-common-5.4.0-3.el7ost.noarch openstack-tripleo-validations-5.1.0-5.el7ost.noarch openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch How reproducible: 25% Steps to Reproduce: 1. Deploy the overcloud on a similar environment as described in the files attached Actual results: the deployment failed with lower number of OSDs deployed Expected results: the deployment passed successfully with all OSDs active Additional info: