Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1541152

Summary:	[RFE] Request for optional validation at the end of a ceph-ansible deploy
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	John Fulton <johfulto>
Component:	Ceph-Ansible	Assignee:	Guillaume Abrioux <gabrioux>
Status:	CLOSED WORKSFORME	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.0	CC:	adeza, aschoen, bengland, ceph-eng-bugs, gabrioux, gfidente, gmeno, hnallurv, kdreyer, mcornea, nthomas, sankarshan, sasha, shan, trozet, yrabl
Target Milestone:	rc	Keywords:	FutureFeature
Target Release:	3.*
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-01-09 14:19:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1578730

Description John Fulton 2018-02-01 21:28:16 UTC

If I use ceph-ansible to deploy ceph and my configuration file specifies 36 OSDs or to create 7 pools and the resulting deployment has less than 36 OSDs or less than 7 pools, then should ansible return a error?

It would be nice if the user could choose if it does or even set a desired success rate. For example consider the deployment failed if I get less than 80% of my OSDs or if I don't get all of my storage pools. 

Some users might want the tool to tell them so they can investigate. Some users who use ceph-ansible as a part of a larger tool, e.g. OpenStack director, might want to know but not have a long deployment fail for that reason so that they could manually fix it after a larger cloud is deployed.

Comment 3 John Fulton 2018-02-01 21:33:35 UTC

This issue is relevant to the following OpenStack director bug:

 https://bugzilla.redhat.com/show_bug.cgi?id=1539852

A problem that occurred 20 minutes into the deployment (ceph is set up by ceph-ansible in step 2) where ceph-ansible only set up 1 of the 7 storage pools that openstack required, didn't manifest until nearly an hour into the deployment when the gnochi container, which depends on that the metrics storage pool, failed to use it. 

If ceph-ansible had the validations, then director could make them a default option and it would fail the deployment unless all 7 storage pools were created. It would also use a meaningful error message from the validation, e.g. only 1 of 7 storage pools created, so the user could troubleshoot the ceph installation (not the gnochi installation which is what happened).

Comment 4 Ben England 2018-02-06 13:26:15 UTC

Part of the missing OSD issue could be handled by rule-based OSD selection, assuming that the introspection database knows that there are less than 36 drives.  It seems improbable that a drive would fail between introspection and deployment, but it is possible. 

Suggestion: Perhaps in ceph-ansible, make deployment of individual OSDs not return error status, but put in a separate rule that checks that some fraction of OSDs were successfully deployed, say at least 95%.   For a small cluster that is essentially the same as it is now.  For large clusters this would prevent the problem that you describe, wouldn't it?  If people really want every single OSD to be present, make this fraction a settable parameter and they can set it to 100%.

Storage pool failures are far more serious, as an entire OpenStack/Ceph component then becomes unusable.  Therefore IMO all pools are required for successful deployment.

Comment 6 Giulio Fidente 2018-02-22 10:09:39 UTC

*** Bug 1499136 has been marked as a duplicate of this bug. ***

Comment 7 Giulio Fidente 2018-02-22 10:10:20 UTC

*** Bug 1547671 has been marked as a duplicate of this bug. ***

Comment 9 Tim Rozet 2018-05-16 18:06:10 UTC

Submitted a PR for this:
https://github.com/ceph/ceph-ansible/pull/2599

Comment 10 Giulio Fidente 2018-09-19 12:20:01 UTC

*** Bug 1628386 has been marked as a duplicate of this bug. ***

Comment 14 John Fulton 2019-01-09 14:19:03 UTC

Talked to Seb we do have a report at the end. If PM is pushing for this then it's possible but it was deemed not a priority.

Comment 15 John Fulton 2019-05-10 15:01:40 UTC

A variation of https://review.opendev.org/#/c/657175 but for OSD percentage could be used to do this.

Comment 16 John Fulton 2019-06-19 18:59:32 UTC

We now have a check on the TripleO side to confirm that at least 2/3 of OSDs are running before continuing the deployment. That setting can also be set to a different percentage: https://review.opendev.org/#/c/657175/ should be available in OSP16