Description of problem: At large scale, it is difficult, tedious, repetitive and sometimes impossible to specify Ceph OSDs by device name, and to specify for each OSD a SSD journal (bz 1438572 must be fixed first!). Installation issues can prevent scalability just as much as other implementation issues. This is difficult because: - device names are not guaranteed to be stable across reboots - at large scale, the number and type of devices may differ due to purchasing hardware at different times - at large scale, there is a high probability that one or more HDDs may have failed and not been replaced, and we need the deployment to succeed anyway These causes are discussed below in "Additional Info" section. Use of OpenStack directory for HCI (hyperconverged) deployment has made this problem more visible, because there is no option to use an external Ceph cluster (deployed via ceph-ansible) in this configuration. possible form of solution: To support scalability of OpenStack-on-Ceph configurations, including HCI (hyperconverged) storage, OpenStack needs to support discovery of Ceph OSDs by a rule, using either a device naming pattern, including /dev/disk/by-path/ names, drive size, or other attribute. The discovery process should convert the rule into a set of /dev/disk/by-path names and attributes for each node, and this should be logged by puppet-ceph as the deployment is performed. It is implied in this syntax that whatever drive is used for the operating system or for a Ceph SSD journal would not be used as a Ceph OSD. This benefits the installation by getting rid of most of the ceph YAML file and potential sources of user error. More importantly, if there are nodes that deviate either in the number or type of storage hardware from the norm, these nodes can still be deployed. The bigger the deployment, the more installation complexity is removed by this feature. Here's a thumbnail of how this could work. Instead of specifying "ceph::profile::params::osds:" in the ceph YAML file, we could specify: ceph::profile::params::osd_selection_rule: size: '4 TB' or a regular expression for device name, such as: ceph::profile::params::osd_selection_rule: name: '/dev/sd*[a-z]' or, if necessary (see additional info below), ceph::profile::params::osd_selection_rule: name: '/dev/disk/by-path/pci-0000:03:00.0-sas-0x1221000000000000-lun-*' Version-Release number of selected component (if applicable): RHOSP 10 (Newton) How reproducible: The documentation below describes the current situation: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/red_hat_ceph_storage_for_the_overcloud/#Mapping_the_Ceph_Storage_Node_Disk_Layout Actual results: In the scale lab using hardware from an OpenStack partner, SuperMicro ( https://www.supermicro.com/solutions/Cloud.cfm ), OpenStack repeatedly failed to deploy with only 8 cephstorage nodes, let alone what would happen with 30 nodes (which we intend to test). We worked around problem by deploying using /dev/disk/by-path names, but we had to cut our drive count from 36 to 34 (because of failed drives in some of the hosts)to get it to deploy, wasting over 10 drives. Expected results: RHOSP OpenStack+Ceph deploys every time, successfully and easily, at scale, using all of the hardware made available to it. Additional info: We should look at ceph-ansible support for a discovery process like this, since there is talk of getting support for ceph-ansible into OpenStack. device-name stability: OpenStack, unlike ceph-ansible, reboots nodes as part of normal deployment process. Device names are not guaranteed by Linux to be stable across reboots, and they are not stable across reboots when there are two storage controllers, as we can demonstrate with SuperMicro 6048Rs in the scale lab). This can be worked around using /dev/disk/by-path specifications and the bug fix done by Joe Talerico et al in: https://review.openstack.org/#/c/451826 non-identical hardware: If an OpenStack site is not purchased in its entirety at the exact same time with the exact same hardware specifications, then it may have different types of storage controllers in different slots and/or different number and types of drives. We don't want to place unreasonable restrictions on the homogeneity of cephstorage nodes within an OpenStack cluster. disk failure probability: When you get to a scale where you have on the order of 1000 HDDs in a configuration, a few of them are going to fail - this is normal behavior given the drive MTBF etc for a cluster with so many drives in it. We don't want the OpenStack deploy for the entire node to fail because of this.
I filed a related problem report requesting that introspection capture /dev/disk/by-path name for each block device. https://bugs.launchpad.net/ironic/+bug/1679726 Device names can also be expressed as /dev/disk/by-id/wwn-<wwid> softlinks - the WWN is in introspection data today, but is different for every block device, so that OSD device names cannot be specified in a node-independent way. Use of /dev/disk/by-path or other softlinks depends on Joe Talerico's fix (now upstream) to https://bugs.launchpad.net/puppet-ceph/+bug/1677605 One alternative to this solution is that programs could be written to transform introspection data into deployment YAML files. The program then becomes the rule-based solution, instead of YAML syntax. The problem with this approach is that it is brittle - the YAML output by such a program will rapidly become out of date and requires re-introspection and YAML regeneration, whereas a rule-based approach can cover a variety of situations without requiring changes to YAML to avoid deploy failures (for example, if disks are added or removed).
+1 on the idea of selecting OSD data disks using regexp/globing, thanks!
here is an ugly bash prototype script that takes introspection data and turns it into a .csv table that specifies node-uuid,device-name,device-wwn . It does capture *all* eligible OSD drives, and in this respect it is better than what we have now, which is to assume that every OSD host has the same number of drives. This output data should be sufficient to generate yaml that is used as input to a deployment. It finds all the disks reported by introspection, and then filters out the system disk and disks that do not have the right size. All of this is available from introspection data. It does not require /dev/disk/by-path names. It outputs wwid identifier for eligible drives, so that we can use /dev/disk/by-id names for the disks that will persist across reboots, avoiding device name instability problems. http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/openstack/introspect/generate-yaml.sh I had to comment out parts of it that used openstack commands to make it work with saved introspection data (had no live system at that point) obtained from prior "openstack baremetal introspection data save" commands. This bash script relies heavily on the "jq" JSON-parsing utility. A native python implementation would be much cleaner. -- output log -- ben@bene-laptop introspect]$ INTROSPECT_DIR=logs bash generate-yaml.sh | tee generate-yaml.log looking for device names of this form: sd*[a-z] looking for devices with this size (GB): 1999 rejected disk sda because it is the system disk 0 OSD drives found in node 9d8526d4-84f9-4068-a1d1-a073bb9783c6 rejected disk sda because it is the system disk 0 OSD drives found in node ee4e17cb-1c5b-41cf-add2-5cc58fdb038f ... rejected disk sda because it is the system disk rejected disk sdal because of size 500 36 OSD drives found in node 21e56a0a-d403-426e-aef9-a6c210dbb9c4 rejected disk sdak because it is the system disk rejected disk sdal because of size 500 36 OSD drives found in node beb01552-af9e-4781-97f7-3c51af7286fc ... rejected disk sdak because it is the system disk rejected disk sdal because of size 500 36 OSD drives found in node ec4caa56-2786-422b-9664-4fb77ec7e474 --- echo 972 OSD drives stored in logs/osd_drives.csv ---
This new python script supercedes the ugly shell hack in comment .-2 . https://github.com/bengland2/openstack-osd-discovery
since ceph-ansible is now being used to deploy Ceph with Openstack, would the implementation of this be in ceph-ansible or OOO? Anyway, ironic has done the introspection, we should be able to determine what devices to use from that, and ceph-ansible should be able to deploy OSDs on these devices if given the appropriate inputs in the inventory file.
- openstack RFE 1438572 depends on ceph 1438590 - ceph RFE 1438590 blocks openstack RFE 1438572
Registered blueprint https://blueprints.launchpad.net/tripleo/+spec/osds-by-rule
Correct Ben. The issue was already attached this BZ btw :).
Untestable in the 3.2 timeframe, so targeting z1 with ceph-volume changes coming.
*** Bug 1486537 has been marked as a duplicate of this bug. ***
This is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1644611