Bug 1541415 - [RFE] Support multiple OSDs per NVMe SSD
Summary: [RFE] Support multiple OSDs per NVMe SSD
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Ceph-Volume
Version: 3.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: rc
: 3.2
Assignee: Andrew Schoen
QA Contact: Vasishta
John Brier
URL:
Whiteboard:
: 1588085 (view as bug list)
Depends On:
Blocks: 1629656 1572368 1594251
TreeView+ depends on / blocked
 
Reported: 2018-02-02 14:10 UTC by John Fulton
Modified: 2019-05-06 15:30 UTC (History)
26 users (show)

Fixed In Version: RHEL: ceph-ansible-3.2.0-0.1.rc1.el7cp, ceph-12.2.8-23.el7cp Ubuntu: ceph-ansible_3.2.0~rc1-2redhat1, ceph_12.2.8-21redhat1
Doc Type: Enhancement
Doc Text:
.Specifying more than one OSD per device is now possible With this version, a new `batch` subcommand has been added. The `batch` subcommand includes the `--osds-per-device` option that allows specifying multiple OSD per device. This is especially useful when using high-speed devices, such as Non-volatile Memory Express (NVMe).
Clone Of:
Environment:
Last Closed: 2019-01-03 19:01:20 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github ceph ceph-ansible issues 2126 None None None 2018-04-25 17:24:48 UTC
Github ceph ceph-ansible pull 2425 None None None 2018-04-19 09:15:45 UTC
Github ceph ceph-ansible pull 3111 None None None 2018-09-12 15:18:18 UTC
Github ceph ceph-ansible pull 3269 None None None 2018-10-26 14:33:11 UTC
Github ceph ceph pull 24060 None None None 2018-09-12 15:18:57 UTC
Github ceph ceph pull 24587 None None None 2018-10-26 12:40:21 UTC
Red Hat Product Errata RHBA-2019:0020 None None None 2019-01-03 19:01:49 UTC

Description John Fulton 2018-02-02 14:10:11 UTC
When Ceph is running on NVMe-SSD OSDs, it needs multiple OSDs per NVM SSD device to fully utilize the device, as stated in this Ceph documentation page section "NVMe SSD partitioning" [1], but ceph-ansible's normal osd_scenarios "collocated" and "non-collocated" do not support this at the present time - they expect "devices" to point to an entire block device, not a partition in a device. This is a downstream bugzilla to request for the product the upstream github issue 2126 [2]. 

[1] http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning

[2] https://github.com/ceph/ceph-ansible/issues/2126

Comment 3 Ben England 2018-02-02 15:42:42 UTC
here is a performance example of why this is important.  Note that performance with 2 OSDS/NVM is almost double that for perf with 1 OSD/NVM.

https://mojo.redhat.com/groups/product-performance-scale-community-of-practice/blog/2018/01/31/bluestore-on-all-nvm-rados-bench#jive_content_id_OSDs_per_NVM_SSD

Note that I used ceph-volume, upstream ceph-ansible and Luminous 12.2.2 to achieve this result, not RHCS 3.0.

Comment 4 Ben England 2018-02-02 15:42:43 UTC
here is a performance example of why this is important.  Note that performance with 2 OSDS/NVM is almost double that for perf with 1 OSD/NVM.

https://mojo.redhat.com/groups/product-performance-scale-community-of-practice/blog/2018/01/31/bluestore-on-all-nvm-rados-bench#jive_content_id_OSDs_per_NVM_SSD

Note that I used ceph-volume, upstream ceph-ansible and Luminous 12.2.2 to achieve this result, not RHCS 3.0.

Comment 7 Alfredo Deza 2018-03-06 19:43:24 UTC
Technically, ceph-volume does this already, provided that the logical volumes are created before hand and then specified for ceph-ansible to consume.

Comment 8 Alfredo Deza 2018-03-15 18:58:44 UTC
Commenting again, this is plain impossible with ceph-disk, and ceph-volume is fully capable of handling it, given the LVs are made. Can we close this? Or can we get clarification on what else is needed here?

Comment 9 leseb 2018-03-15 19:04:22 UTC
I second what Alfredo is saying, if this is something supported out of the box by ceph-volume then there is no need for an RFE and this should be closed. We just need to make sure osp/ooo adds the support for ceph-volume.

Comment 10 Federico Lucifredi 2018-04-03 19:26:24 UTC
I think it is fine to require ceph-volume for NVME support.

This has documentation impact and an OSP impact, so not closing the bug (but someone else should feel free to slice and dice into two if that helps).

Comment 11 Ben England 2018-04-06 13:45:08 UTC
Federico, there is still a slight functionality gap.  ceph-volume supports LVM and therefore ceph-ansible would be fine, except that nothing will construct the LVM volumes that are fed to ceph-ansible at present.  Not hard to do, but where does this happen automatically during OOO deployment?

Comment 12 John Fulton 2018-04-09 12:52:39 UTC
What if ceph-ansible had a new feature which took parameters like this:

osd_lvm_count: 2

physical_volumes:
- /dev/nvme0n1
- /dev/nvme2n1
...

and then made "osd_lvm_count" LVs on each PV and then used those LVs as if they had originally been passed under the "devices" list?

Comment 13 John Fulton 2018-04-09 13:26:07 UTC
Proposal:

1. update docs so user manually does what is in comment #12
2. if some future version of ceph-ansible gets the feature from comment #12, then update docs to use feature instead

We might be able to provide a preboot script along with the docs update which would set up the LVs during step1 (we used to use a script like that to clean the disks during deployment -- ironic cleans them now). 

What do you think of comment #12 Seb?

Comment 14 John Fulton 2018-04-09 13:36:45 UTC
(In reply to leseb from comment #9)
> I second what Alfredo is saying, if this is something supported out of the
> box by ceph-volume then there is no need for an RFE and this should be
> closed. We just need to make sure osp/ooo adds the support for ceph-volume.

Yes, osp/ooo can ship the appropriate ceph versions which include ceph-volume. However, as per Ben's comment #11, shouldn't ceph-ansible set up the LVs?

Suppose OSP, were not in the picture and you have customers deploying Ceph in a new environment. Are you going to require as a prerequisite that they have the logical volumes already created on their systems? If so, they might configure those LVs inconsistently across their systems or not configure them optimally (e.g. too many LVs per PV). These are the types of problems which lead to the need for deployment tools which ensure it's done correctly in every deployment. For this reason, I am asking if what's in comment #12, or something like it, could be a future feature of ceph-ansible.

Comment 15 leseb 2018-04-19 09:15:46 UTC
Work for this is in-progress and will be solved by the introduction of choose_disk + the ability to create required PV/VG/LV.

The subject of the PR has changed a bit since now we are more talking about a set of pre-tasks that will create the create the PV/VG/LV.

Comment 18 Christina Meno 2018-06-04 15:47:56 UTC
Harish it's in 3.*, Would you please say where are you proposing I put it?

Comment 19 Harish NV Rao 2018-06-05 06:57:43 UTC
3.2, if it's going to be fixed there.

Comment 23 Randy Martinez 2018-06-12 01:21:01 UTC
John,

I don't think this is the case anymore. I've validated ceph-disk can in fact support non-collocated scenario w/NVMEs. Update the osds.yml to reflect the following:

osd_scenario: non-collocated
devices:
  - /dev/sdb
  - /dev/sdc
  - /dev/sdd
dedicated_devices:
  - /dev/nvme0n1
  - /dev/nvme0n1
  - /dev/nvme0n1

New partitions on nvme0n1 will be added automatically in line with journal_size configured.

Comment 25 Federico Lucifredi 2018-06-18 13:49:55 UTC
Outcome of triage is that this is a difficult objective for 3.0Z5.

Comment 26 Mike Hackett 2018-06-18 14:15:43 UTC
*** Bug 1588085 has been marked as a duplicate of this bug. ***

Comment 27 John Fulton 2018-07-17 12:39:37 UTC
(In reply to Randy Martinez from comment #23)
> John,
> 
> I don't think this is the case anymore. I've validated ceph-disk can in fact
> support non-collocated scenario w/NVMEs. Update the osds.yml to reflect the
> following:
> 
> osd_scenario: non-collocated
> devices:
>   - /dev/sdb
>   - /dev/sdc
>   - /dev/sdd
> dedicated_devices:
>   - /dev/nvme0n1
>   - /dev/nvme0n1
>   - /dev/nvme0n1
> 
> New partitions on nvme0n1 will be added automatically in line with
> journal_size configured.

Hi Randy,

That works but it's for something else. That's not what this bug is about. This bug is about passing a PV like /dev/nvme0n1 and a number, e.g. 4, and then having ceph-ansible create 4 LVs on that PV and then using those LVs as devices as if I had created them myself and then passed this:

devices:
  - lv1
  - lv2
  - lv3
  - lv4

For info on why you would do this see http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning

Comment 31 Ken Dreyer (Red Hat) 2018-09-28 18:24:03 UTC
What other changes must happen for ceph-ansible to resolve this BZ beyond the ceph-ansible v3.2.0beta2 release?

Comment 32 leseb 2018-10-01 09:56:28 UTC
None, but I'll let Andrew confirm since he recently added the support for "batch" in ceph-ansible. Thanks.

Comment 37 leseb 2018-10-19 08:11:35 UTC
Thanks Bara, the only thing to mention is that this is not supported by the containerized deployment. Thanks.

Comment 39 leseb 2018-10-22 14:41:05 UTC
I'm actually assigning this to Andrew since he did the implementation and I'll let him answer your question as well John.
Thanks.

Comment 40 Ben England 2018-10-22 17:59:08 UTC
I don't follow why containerized Ceph should be different, other than that it's site-docker.yml vs site.yml.

Comment 41 leseb 2018-10-23 08:42:15 UTC
Ben, it is different because batch does not support prepare only, see: http://tracker.ceph.com/issues/36363

Comment 43 leseb 2018-10-24 12:47:38 UTC
John, please do not forget to mention that this does not support containerized deployments. Thanks

Comment 49 John Fulton 2018-10-26 12:12:45 UTC
(In reply to leseb from comment #41)
> Ben, it is different because batch does not support prepare only, see:
> http://tracker.ceph.com/issues/36363

So support for this feature in containers depends on the above issue being completed.

As far as you know, when it is completed will that be sufficient and this feature will be supported with containers?

If so will that be tracked in a different bug?

  John

Comment 50 Andrew Schoen 2018-10-26 12:40:22 UTC
(In reply to John Fulton from comment #49)
> (In reply to leseb from comment #41)
> > Ben, it is different because batch does not support prepare only, see:
> > http://tracker.ceph.com/issues/36363
> 
> So support for this feature in containers depends on the above issue being
> completed.
> 
> As far as you know, when it is completed will that be sufficient and this
> feature will be supported with containers?
> 
> If so will that be tracked in a different bug?
> 
>   John

The 'ceph-volume lvm batch --prepare' feature is completed upstream, merged to master and currently being backported to luminous. Once it's merged to master we'll get it cherry-picked downstream.

Sebastian, when are you able to start on the container support for this?

Comment 51 Alfredo Deza 2018-10-26 13:42:13 UTC
Commits from https://github.com/ceph/ceph/pull/24587 have been pushed downstream

Comment 54 leseb 2018-10-26 14:33:11 UTC
Andrew, the patch is upstream, I just added it to the BZ.

Comment 56 Ken Dreyer (Red Hat) 2018-10-31 17:19:07 UTC
Seb added https://github.com/ceph/ceph-ansible/pull/3269 to this BZ, so I'm resetting Fixed In Version to ceph-ansible 3.2.0rc1.

Comment 60 Vasishta 2018-12-05 05:40:27 UTC
All planned testcases have been completed successfully, moving BZ to VERIFIED state.

Regards,
Vasishta Shastry
QE, Ceph

Comment 62 errata-xmlrpc 2019-01-03 19:01:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0020


Note You need to log in before you can comment on or make changes to this bug.