If I use ceph-ansible to deploy ceph and my configuration file specifies 36 OSDs or to create 7 pools and the resulting deployment has less than 36 OSDs or less than 7 pools, then should ansible return a error? It would be nice if the user could choose if it does or even set a desired success rate. For example consider the deployment failed if I get less than 80% of my OSDs or if I don't get all of my storage pools. Some users might want the tool to tell them so they can investigate. Some users who use ceph-ansible as a part of a larger tool, e.g. OpenStack director, might want to know but not have a long deployment fail for that reason so that they could manually fix it after a larger cloud is deployed.
This issue is relevant to the following OpenStack director bug: https://bugzilla.redhat.com/show_bug.cgi?id=1539852 A problem that occurred 20 minutes into the deployment (ceph is set up by ceph-ansible in step 2) where ceph-ansible only set up 1 of the 7 storage pools that openstack required, didn't manifest until nearly an hour into the deployment when the gnochi container, which depends on that the metrics storage pool, failed to use it. If ceph-ansible had the validations, then director could make them a default option and it would fail the deployment unless all 7 storage pools were created. It would also use a meaningful error message from the validation, e.g. only 1 of 7 storage pools created, so the user could troubleshoot the ceph installation (not the gnochi installation which is what happened).
Part of the missing OSD issue could be handled by rule-based OSD selection, assuming that the introspection database knows that there are less than 36 drives. It seems improbable that a drive would fail between introspection and deployment, but it is possible. Suggestion: Perhaps in ceph-ansible, make deployment of individual OSDs not return error status, but put in a separate rule that checks that some fraction of OSDs were successfully deployed, say at least 95%. For a small cluster that is essentially the same as it is now. For large clusters this would prevent the problem that you describe, wouldn't it? If people really want every single OSD to be present, make this fraction a settable parameter and they can set it to 100%. Storage pool failures are far more serious, as an entire OpenStack/Ceph component then becomes unusable. Therefore IMO all pools are required for successful deployment.
*** Bug 1499136 has been marked as a duplicate of this bug. ***
*** Bug 1547671 has been marked as a duplicate of this bug. ***
Submitted a PR for this: https://github.com/ceph/ceph-ansible/pull/2599
*** Bug 1628386 has been marked as a duplicate of this bug. ***
Talked to Seb we do have a report at the end. If PM is pushing for this then it's possible but it was deemed not a priority.
A variation of https://review.opendev.org/#/c/657175 but for OSD percentage could be used to do this.
We now have a check on the TripleO side to confirm that at least 2/3 of OSDs are running before continuing the deployment. That setting can also be set to a different percentage: https://review.opendev.org/#/c/657175/ should be available in OSP16