Bug 1438572 - [RFE] Ceph OSD specification of SSD journals unnecessarily complex
Summary: [RFE] Ceph OSD specification of SSD journals unnecessarily complex
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: ceph-ansible
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: Upstream M2
: ---
Assignee: John Fulton
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On: 1438590
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-04-03 19:36 UTC by Ben England
Modified: 2019-01-25 09:00 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-25 09:00:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ben England 2017-04-03 19:36:58 UTC
Description of problem:

In a large-scale OpenStack-on-Ceph configuration, the specification of SSD journal device FOR EACH OSD is tedious and unnecessary, particularly on systems with many OSDs/host.  It should be possible for the sysadmin to either 
- list all SSD devices in the system once
- or list a rule for discovering all SSD devices in the system (e.g. regular expression)
And have ceph-puppet module spread the OSDs evenly across these SSDs, which is all that people typically do when specifying the SSD device for each OSD.

background: Almost all Ceph sites with HDD-backed OSDs use SSD journaling to accelerate writes.  Higher densities of HDD drives/host, such as 36 drives/host, allow sites to get cost/TB down and IOPS/host up, so most sites want to have a high density of HDDs.

Version-Release number of selected component (if applicable):

OSP 10 (Newton)



Actual results:

current syntax:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/red_hat_ceph_storage_for_the_overcloud/#Mapping_the_Ceph_Storage_Node_Disk_Layout

Expected results:

sysadmin should not have to specify "journal" property for each OSD, and should be able to specify SSDs once, by enumeration, using syntax similar to this:

ceph::profile::params::ssd_journals:
- '/dev/nvme0n1'
- '/dev/nvme1n1'

or for rule-based selection:

ceph::profile::params::ssd_journal_rule:
  name: '/dev/nvme?n1'

using a set of criteria that are stable across reboots and across nodes, including size and block device pathname (possibly /dev/disk/by-path).

Additional info:

A solution to this problem leads into the next simplification, which is to allow discovery of OSD devices by rule, important for scaling to really large configurations.  It is not possible to do this if the user must specify which SSD journal goes with which OSD HDD.

I will try to look at ceph-ansible and verify that it can do these things as well, since there is a possibility of moving to a ceph-ansible-based deploy tool(?).

Comment 1 Ben England 2017-04-17 12:10:09 UTC
It turns out that Ceph-ansible is going to be used in Pike so we need to focus on ceph-ansible's capabilities in this area.

Comment 2 John Fulton 2017-04-17 16:16:58 UTC
I've switched this from puppet-ceph to tripleo-heat-templates (THT) because the specification of OSDs and their journals are done in THT and puppet-ceph processes the parameters it is passed from Heat. As per the spec for ceph-ansible to process the parameters in place of puppet-ceph, in Pike we will maintain parity with existing THT [1]. Thus, I think we should continue to allow users to specify their OSDs and journals in the current format [2]. 

That is not to say this cannot be done however. One way to implement this RFE may be the following: 

1. Have an RFE for ceph-ansible to support the proposed rule syntax or something like it, e.g. "/dev/nvme?n1"

2. Introduce the proposed syntax for ceph::profile::params::ssd_journals as an alternative method to THT. 

3. Once ceph-ansible supports #1, then introduce the proposed syntax for ceph::profile::params::ssd_journal_rule as a second alternative method to THT. 

Items #2 and #3 above might be a late addition for Pike as the ceph-ansible integration with parity of the existing THT would need to come first to be backwards compatible and pass CI. Note that the second item borrows directly from the ceph-ansible syntax [3]. 

All this is my long way of saying that I am open to this change but just want to point out some details so we do not break backwards compatibility. More DFG:Ceph folks should weigh in of if it will fit for Pike. 
 

[1] http://specs.openstack.org/openstack/tripleo-specs/specs/pike/tripleo-ceph-ansible-integration.html#avoid-duplication-of-effort

[2] https://github.com/RHsyseng/hci/blob/master/custom-templates/ceph.yaml#L17

[3] https://github.com/ceph/ceph-ansible/blob/master/group_vars/osds.yml.sample#L83-L87

Comment 3 Giulio Fidente 2017-04-18 10:11:02 UTC
The current implementation is meant to allow for selection of a specific and different journal device for every OSD.

If I understand correctly, this is an RFE to permit selection of the journal devices basing on a regexp/globing mechanism.

Ben, can we update the subject accordingly? Shall we implement a similar mechanism for the OSD data devices filtering/colletion too?

Comment 4 Ben England 2017-04-18 11:24:38 UTC
The regexp proposal was not the main point of the bz - the point was that you shouldn't have to manually specify which SSD goes with which OSD, 20 or 30 times.  So I think the subject is correct.  The rule proposal was intended to show how you could make it even simpler to specify your SSD journal devices, with 2 lines of YAML.  But it is not required.  Enumeration has certain advantages here - for example, do you want the deploy to be attempted if one of your NVM SSD journals is missing?

The last sentence on "similar mechanism for the OSD data devices" is addressed by a different bz 1438590.  However, I separated the two so it is possible to start making progress - 1438590 is blocked by this bz because you can't avoid enumeration of OSD data devices if you have to specify an SSD journal explicitly for each OSD.

In hindsight, there is another possible way to resolve this set of problems - transform introspection data into YAML using rules for block device assignment.  If you want to support this approach, please consider Ironic upstream problem report 

https://bugs.launchpad.net/ironic/+bug/1679726

Comment 5 Giulio Fidente 2017-04-18 12:00:58 UTC
hi Ben, +1 on the idea of allowing for enumeration of the journal devices but I'd consider it a new feature; I don't think that the existing mechanism is unnecessarily complex. It allows for at least two config scenarios which could not be resolved by enumerating the journal devices, for example:

1) mapping data data to specific journal devices (not all journal devices are the same)

2) having on a single storage node, data devices which use external journals (fast, on ssd) and others which use colocated journals (slower)

Comment 6 Ben England 2017-04-18 13:52:09 UTC
Giulio, re comment 5:

1) 

I do not see this happening in a significant percentage of cases - almost always the journal devices are the same.  Let's take a poll, ask Arkady for example, but from what I've seen this situation is the exception.  Do we have to design our user interface around the requirements of the least likely situation?  We can keep the syntax for specifying SSD journal for an OSD, but simply add the syntax like I proposed as an option, so that when a list of SSDs is specified as above, you do not have to specify an SSD journal for each OSD and OpenStack will assign these SSD journals to OSDs in a round-robin fashion.

Furthermore, if we do not alter the YAML to allow the user to avoid specifying the SSD journal per OSD, then we can do nothing about 1438590 (discover OSD devices by rule).  IMHO this is the most important change to permit scalability of OpenStack+Ceph.

2)

now you are getting at a really interesting problem, which is the need to set up different Ceph storage pools with different QoS (Quality Of Service).  I'm very interested in this feature.  For example, if you want to have an all-flash pool for low-latency storage, this requires altering the Ceph CRUSH map to segregate these devices so that other lower-QoS pools will not use them.   It would be great to be able to store Cephfs metadata in an all-SSD pool, and similar things can be done with RGW as well to speed up metadata access. 

Is there anything in RHOSP today that supports use of Ceph crush maps to implement QoS?  AFAIK ceph-ansible does not support it yet.  This functional enhancement probably needs to be a separate bz.

Comment 7 Giulio Fidente 2017-04-18 14:51:36 UTC
hi Ben, I think we agree:

1) we can keep the existing mechanism but we should also implement an easier mechanism to ease the deployment when possible (which is the purpose of this BZ)

2) we should start thinking about the crushmap manipulation to ease the deployment of more complex configurations

Comment 8 Ben England 2017-07-18 15:18:55 UTC
You may not be able to keep the existing mechanism because ceph-ansible works differently than puppet-ceph.  In Ceph-ansible, there are two variables that have to be defined:

- devices - list of block devices where OSD data goes
- raw_journal_devices - list of SSD journal devices

These two lists are defined so that each entry in "devices" has a corresponding entry in "raw_journal_devices" for it, and the journal device must be the entire block device, not a partition!  So there is no way to pass information into ceph-ansible about which partition to use.  This means starting in RHOSP 12 there is no way to use existing YAMLs that reference specific SSD partitions, am I right?

Comment 12 Sébastien Han 2018-01-19 08:39:08 UTC
We have a prototype we need to resurrect.
This is currently planned for 3.1, although I'm not sure if we are going to make it.

I see this is for 10; this should be re-targeted.
How much time do we have for this?

Comment 17 John Fulton 2018-01-25 18:48:10 UTC
Provided that ceph-ansible gets the feature described in 1438590 [1], this bug will track getting TripleO in Rocky to take advantage of the new feature. 

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1438590

Comment 19 John Fulton 2018-02-12 16:15:04 UTC
Registered blueprint https://blueprints.launchpad.net/tripleo/+spec/osds-by-rule

Comment 25 Ben England 2019-01-03 14:44:28 UTC
Should this bug be assigned in some way to the Ceph(-ansible) team?  I know the Orchestration sandwich project is working on topics like this.

Comment 27 Ben England 2019-01-24 12:22:52 UTC
Actually I just used ceph-ansible for RHCS 3.2 and found that it was not difficult to configure SSD journals anymore - for example, you just specify:

osd_scenario: lvm
osds_per_device: 4
devices:
- /dev/sdb
...
- /dev/sdp
- /dev/nvme0n1

And the rest is taken care of by ceph-volume lvm batch.  Am I missing something?  So I think you can close this bz.

Comment 28 Sébastien Han 2019-01-25 09:00:34 UTC
That's correct Ben. Based on https://bugzilla.redhat.com/show_bug.cgi?id=1438572#c27 I'm closing this as CURRENTRELEASE.
Thanks!


Note You need to log in before you can comment on or make changes to this bug.