Bug 1541152 - [RFE] Request for optional validation at the end of a ceph-ansible deploy
Summary: [RFE] Request for optional validation at the end of a ceph-ansible deploy
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: rc
: 3.*
Assignee: Guillaume Abrioux
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
: 1499136 1547671 1628386 (view as bug list)
Depends On:
Blocks: 1578730
TreeView+ depends on / blocked
 
Reported: 2018-02-01 21:28 UTC by John Fulton
Modified: 2022-03-13 14:40 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-09 14:19:03 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 2599 0 None closed Adds failure checking if OSD is not created 2020-09-01 07:09:34 UTC
Launchpad 1721817 0 None None None 2019-05-10 15:22:40 UTC

Description John Fulton 2018-02-01 21:28:16 UTC
If I use ceph-ansible to deploy ceph and my configuration file specifies 36 OSDs or to create 7 pools and the resulting deployment has less than 36 OSDs or less than 7 pools, then should ansible return a error?

It would be nice if the user could choose if it does or even set a desired success rate. For example consider the deployment failed if I get less than 80% of my OSDs or if I don't get all of my storage pools. 

Some users might want the tool to tell them so they can investigate. Some users who use ceph-ansible as a part of a larger tool, e.g. OpenStack director, might want to know but not have a long deployment fail for that reason so that they could manually fix it after a larger cloud is deployed.

Comment 3 John Fulton 2018-02-01 21:33:35 UTC
This issue is relevant to the following OpenStack director bug:

 https://bugzilla.redhat.com/show_bug.cgi?id=1539852

A problem that occurred 20 minutes into the deployment (ceph is set up by ceph-ansible in step 2) where ceph-ansible only set up 1 of the 7 storage pools that openstack required, didn't manifest until nearly an hour into the deployment when the gnochi container, which depends on that the metrics storage pool, failed to use it. 

If ceph-ansible had the validations, then director could make them a default option and it would fail the deployment unless all 7 storage pools were created. It would also use a meaningful error message from the validation, e.g. only 1 of 7 storage pools created, so the user could troubleshoot the ceph installation (not the gnochi installation which is what happened).

Comment 4 Ben England 2018-02-06 13:26:15 UTC
Part of the missing OSD issue could be handled by rule-based OSD selection, assuming that the introspection database knows that there are less than 36 drives.  It seems improbable that a drive would fail between introspection and deployment, but it is possible. 

Suggestion: Perhaps in ceph-ansible, make deployment of individual OSDs not return error status, but put in a separate rule that checks that some fraction of OSDs were successfully deployed, say at least 95%.   For a small cluster that is essentially the same as it is now.  For large clusters this would prevent the problem that you describe, wouldn't it?  If people really want every single OSD to be present, make this fraction a settable parameter and they can set it to 100%.

Storage pool failures are far more serious, as an entire OpenStack/Ceph component then becomes unusable.  Therefore IMO all pools are required for successful deployment.

Comment 6 Giulio Fidente 2018-02-22 10:09:39 UTC
*** Bug 1499136 has been marked as a duplicate of this bug. ***

Comment 7 Giulio Fidente 2018-02-22 10:10:20 UTC
*** Bug 1547671 has been marked as a duplicate of this bug. ***

Comment 9 Tim Rozet 2018-05-16 18:06:10 UTC
Submitted a PR for this:
https://github.com/ceph/ceph-ansible/pull/2599

Comment 10 Giulio Fidente 2018-09-19 12:20:01 UTC
*** Bug 1628386 has been marked as a duplicate of this bug. ***

Comment 14 John Fulton 2019-01-09 14:19:03 UTC
Talked to Seb we do have a report at the end. If PM is pushing for this then it's possible but it was deemed not a priority.

Comment 15 John Fulton 2019-05-10 15:01:40 UTC
A variation of https://review.opendev.org/#/c/657175 but for OSD percentage could be used to do this.

Comment 16 John Fulton 2019-06-19 18:59:32 UTC
We now have a check on the TripleO side to confirm that at least 2/3 of OSDs are running before continuing the deployment. That setting can also be set to a different percentage: https://review.opendev.org/#/c/657175/ should be available in OSP16


Note You need to log in before you can comment on or make changes to this bug.