1438590 – [RFE] ceph-ansible should allow discovery of Ceph OSD devices by rule

Bug 1438590 - [RFE] ceph-ansible should allow discovery of Ceph OSD devices by rule

Summary: [RFE] ceph-ansible should allow discovery of Ceph OSD devices by rule

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	4.*
Assignee:	Sébastien Han
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1486537 (view as bug list)
Depends On:	1644611
Blocks:	1438572 1624388
TreeView+	depends on / blocked

Reported:	2017-04-03 20:57 UTC by Ben England
Modified:	2019-10-02 08:52 UTC (History)
CC List:	22 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-18 16:03:53 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible issues 1084	0	'None'	closed	Improving disk's device description	2020-12-14 20:05:49 UTC
Github	ceph ceph-ansible pull 2425	0	'None'	closed	Choose disk	2020-12-14 20:05:50 UTC

Description Ben England 2017-04-03 20:57:05 UTC

Description of problem:

At large scale, it is difficult, tedious, repetitive and sometimes impossible to specify Ceph OSDs by device name, and to specify for each OSD a SSD journal (bz 1438572 must be fixed first!).  Installation issues can prevent scalability just as much as other implementation issues.  This is difficult because:

- device names are not guaranteed to be stable across reboots
- at large scale, the number and type of devices may differ due to purchasing hardware at different times
- at large scale, there is a high probability that one or more HDDs may have failed and not been replaced, and we need the deployment to succeed anyway

These causes are discussed below in "Additional Info" section.  Use of OpenStack directory for HCI (hyperconverged) deployment has made this problem more visible, because there is no option to use an external Ceph cluster (deployed via ceph-ansible) in this configuration.

possible form of solution:

To support scalability of OpenStack-on-Ceph configurations, including HCI (hyperconverged) storage, OpenStack needs to support discovery of Ceph OSDs by a rule, using either a device naming pattern, including /dev/disk/by-path/ names, drive size, or other attribute.  

The discovery process should convert the rule into a set of /dev/disk/by-path names and attributes for each node, and this should be logged by puppet-ceph as the deployment is performed.  It is implied in this syntax that whatever drive is used for the operating system or for a Ceph SSD journal would not be used as a Ceph OSD.

This benefits the installation by getting rid of most of the ceph YAML file and potential sources of user error.  More importantly, if there are nodes that deviate either in the number or type of storage hardware from the norm, these nodes can still be deployed.  The bigger the deployment, the more installation complexity is removed by this feature.

Here's a thumbnail of how this could work.  Instead of specifying "ceph::profile::params::osds:" in the ceph YAML file, we could specify:

ceph::profile::params::osd_selection_rule:
  size: '4 TB'

or a regular expression for device name, such as:

ceph::profile::params::osd_selection_rule:
  name: '/dev/sd*[a-z]'   

or, if necessary (see additional info below), 

ceph::profile::params::osd_selection_rule:
  name: '/dev/disk/by-path/pci-0000:03:00.0-sas-0x1221000000000000-lun-*'

Version-Release number of selected component (if applicable):

RHOSP 10 (Newton)

How reproducible:

The documentation below describes the current situation:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/red_hat_ceph_storage_for_the_overcloud/#Mapping_the_Ceph_Storage_Node_Disk_Layout

Actual results:

In the scale lab using hardware from an OpenStack partner, SuperMicro ( https://www.supermicro.com/solutions/Cloud.cfm ), OpenStack repeatedly failed to deploy with only 8 cephstorage nodes, let alone what would happen with 30 nodes (which we intend to test).  We worked around problem by deploying using /dev/disk/by-path names, but we had to cut our drive count from 36 to 34 (because of failed drives in some of the hosts)to get it to deploy, wasting over 10 drives.

Expected results:

RHOSP OpenStack+Ceph deploys every time, successfully and easily, at scale, using all of the hardware made available to it.

Additional info:

We should look at ceph-ansible support for a discovery process like this, since there is talk of getting support for ceph-ansible into OpenStack.

device-name stability: OpenStack, unlike ceph-ansible, reboots nodes as part of normal deployment process.  Device names are not guaranteed by Linux to be stable across reboots, and they are not stable across reboots when there are two storage controllers, as we can demonstrate with SuperMicro 6048Rs in the scale lab).  This can be worked around using /dev/disk/by-path specifications and the bug fix done by Joe Talerico et al in:

https://review.openstack.org/#/c/451826

non-identical hardware:  If an OpenStack site is not purchased in its entirety at the exact same time with the exact same hardware specifications, then it may have different types of storage controllers in different slots and/or different number and types of drives.  We don't want to place unreasonable restrictions on the homogeneity of cephstorage nodes within an OpenStack cluster.

disk failure probability:  When you get to a scale where you have on the order of 1000 HDDs in a configuration, a few of them are going to fail - this is normal behavior given the drive MTBF etc for a cluster with so many drives in it.  We don't want the OpenStack deploy for the entire node to fail because of this.

Comment 1 Ben England 2017-04-18 11:47:00 UTC

I filed a related problem report requesting that introspection capture /dev/disk/by-path name for each block device.

https://bugs.launchpad.net/ironic/+bug/1679726

Device names can also be expressed as /dev/disk/by-id/wwn-<wwid> softlinks - the WWN is in introspection data today, but is different for every block device, so that OSD device names cannot be specified in a node-independent way.   Use of /dev/disk/by-path or other softlinks depends on Joe Talerico's fix (now upstream) to

https://bugs.launchpad.net/puppet-ceph/+bug/1677605

One alternative to this solution is that programs could be written to transform introspection data into deployment YAML files.   The program then becomes the rule-based solution, instead of YAML syntax.  The problem with this approach is that it is brittle - the YAML output by such a program will rapidly become out of date and requires re-introspection and YAML regeneration, whereas a rule-based approach can cover a variety of situations without requiring changes to YAML to avoid deploy failures (for example, if disks are added or removed).

Comment 2 Giulio Fidente 2017-04-18 11:50:01 UTC

+1 on the idea of selecting OSD data disks using regexp/globing, thanks!

Comment 3 Ben England 2017-06-19 13:00:33 UTC

here is an ugly bash prototype script that takes introspection data and turns it into a .csv table that specifies node-uuid,device-name,device-wwn .  It does capture *all* eligible OSD drives, and in this respect it is better than what we have now, which is to assume that every OSD host has the same number of drives.  
This output data should be sufficient to generate yaml that is used as input to a deployment.  It finds all the disks reported by introspection, and then filters out the system disk and disks that do not have the right size.  All of this is available from introspection data.  It does not require /dev/disk/by-path names.  It outputs wwid identifier for eligible drives, so that we can use /dev/disk/by-id names for the disks that will persist across reboots, avoiding device name instability problems.

http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/openstack/introspect/generate-yaml.sh

I had to comment out parts of it that used openstack commands to make it work with saved introspection data (had no live system at that point) obtained from prior "openstack baremetal introspection data save" commands.

This bash script relies heavily on the "jq" JSON-parsing utility.  A native python implementation would be much cleaner.

-- output log --

ben@bene-laptop introspect]$ INTROSPECT_DIR=logs bash generate-yaml.sh | tee generate-yaml.log
looking for device names of this form: sd*[a-z]
looking for devices with this size (GB): 1999
rejected disk sda because it is the system disk
0 OSD drives found in node 9d8526d4-84f9-4068-a1d1-a073bb9783c6
rejected disk sda because it is the system disk
0 OSD drives found in node ee4e17cb-1c5b-41cf-add2-5cc58fdb038f
...
rejected disk sda because it is the system disk
rejected disk sdal because of size 500 
36 OSD drives found in node 21e56a0a-d403-426e-aef9-a6c210dbb9c4
rejected disk sdak because it is the system disk
rejected disk sdal because of size 500 
36 OSD drives found in node beb01552-af9e-4781-97f7-3c51af7286fc
...
rejected disk sdak because it is the system disk
rejected disk sdal because of size 500 
36 OSD drives found in node ec4caa56-2786-422b-9664-4fb77ec7e474

--- echo 972 OSD drives stored in logs/osd_drives.csv ---

Comment 4 Ben England 2017-06-28 21:51:25 UTC

This new python script supercedes the ugly shell hack in comment .-2  . 

https://github.com/bengland2/openstack-osd-discovery

Comment 5 Ben England 2018-01-15 16:16:28 UTC

since ceph-ansible is now being used to deploy Ceph with Openstack, would the implementation of this be in ceph-ansible or OOO?  Anyway, ironic has done the introspection, we should be able to determine what devices to use from that, and ceph-ansible should be able to deploy OSDs on these devices if given the appropriate inputs in the inventory file.

Comment 8 John Fulton 2018-01-25 19:01:36 UTC

- openstack RFE 1438572 depends on ceph 1438590
- ceph RFE 1438590 blocks openstack RFE 1438572

Comment 9 John Fulton 2018-02-12 16:14:01 UTC

Registered blueprint https://blueprints.launchpad.net/tripleo/+spec/osds-by-rule

Comment 16 Sébastien Han 2018-05-15 07:32:29 UTC

Correct Ben. The issue was already attached this BZ btw :).

Comment 19 Travis Nielsen 2018-09-25 15:34:40 UTC

Untestable in the 3.2 timeframe, so targeting z1 with ceph-volume changes coming.

Comment 20 Sébastien Han 2018-10-19 10:20:36 UTC

*** Bug 1486537 has been marked as a duplicate of this bug. ***

Comment 25 John Fulton 2019-01-09 14:24:11 UTC

This is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1644611

Note You need to log in before you can comment on or make changes to this bug.

abond
adeza
anharris
aschoen
ceph-eng-bugs
flucifre
gabrioux
gfidente
gmeno
jefbrown
jjoyce
johfulto
jschluet
jtaleric
nthomas
rsussman
sankarshan
slinaber
tnielsen
tvignaud
twilkins
vashastr