1541415 – [RFE] Support multiple OSDs per NVMe SSD

Bug 1541415 - [RFE] Support multiple OSDs per NVMe SSD

Summary: [RFE] Support multiple OSDs per NVMe SSD

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Volume
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	3.2
Assignee:	Andrew Schoen
QA Contact:	Vasishta
Docs Contact:	John Brier
URL:
Whiteboard:
Duplicates (1):	1588085 (view as bug list)
Depends On:
Blocks:	1572368 1594251 1629656
TreeView+	depends on / blocked

Reported:	2018-02-02 14:10 UTC by John Fulton
Modified:	2019-05-06 15:30 UTC (History)
CC List:	26 users (show)
Fixed In Version:	RHEL: ceph-ansible-3.2.0-0.1.rc1.el7cp, ceph-12.2.8-23.el7cp Ubuntu: ceph-ansible_3.2.0~rc1-2redhat1, ceph_12.2.8-21redhat1
Doc Type:	Enhancement
Doc Text:	.Specifying more than one OSD per device is now possible With this version, a new `batch` subcommand has been added. The `batch` subcommand includes the `--osds-per-device` option that allows specifying multiple OSD per device. This is especially useful when using high-speed devices, such as Non-volatile Memory Express (NVMe).
Clone Of:
Environment:
Last Closed:	2019-01-03 19:01:20 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible issues 2126	None	closed	multiple OSDs per NVM SSD	2020-06-26 12:58:09 UTC
Github	ceph ceph-ansible pull 2425	None	closed	Choose disk	2020-06-26 12:58:09 UTC
Github	ceph ceph-ansible pull 3111	None	closed	ceph_volume: adds the osds_per_device parameter	2020-06-26 12:58:09 UTC
Github	ceph ceph-ansible pull 3269	None	closed	ceph_volume: add container support for batch	2020-06-26 12:58:09 UTC
Github	ceph ceph pull 24060	None	closed	ceph-volume batch: allow --osds-per-device, default it to 1	2020-06-26 12:58:09 UTC
Github	ceph ceph pull 24587	None	closed	ceph-volume: adds a --prepare flag to `lvm batch`	2020-06-26 12:58:09 UTC
Red Hat Product Errata	RHBA-2019:0020	None	None	None	2019-01-03 19:01:49 UTC

Description John Fulton 2018-02-02 14:10:11 UTC

When Ceph is running on NVMe-SSD OSDs, it needs multiple OSDs per NVM SSD device to fully utilize the device, as stated in this Ceph documentation page section "NVMe SSD partitioning" [1], but ceph-ansible's normal osd_scenarios "collocated" and "non-collocated" do not support this at the present time - they expect "devices" to point to an entire block device, not a partition in a device. This is a downstream bugzilla to request for the product the upstream github issue 2126 [2]. 

[1] http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning

[2] https://github.com/ceph/ceph-ansible/issues/2126

Comment 3 Ben England 2018-02-02 15:42:42 UTC

here is a performance example of why this is important.  Note that performance with 2 OSDS/NVM is almost double that for perf with 1 OSD/NVM.

https://mojo.redhat.com/groups/product-performance-scale-community-of-practice/blog/2018/01/31/bluestore-on-all-nvm-rados-bench#jive_content_id_OSDs_per_NVM_SSD

Note that I used ceph-volume, upstream ceph-ansible and Luminous 12.2.2 to achieve this result, not RHCS 3.0.

Comment 4 Ben England 2018-02-02 15:42:43 UTC

here is a performance example of why this is important.  Note that performance with 2 OSDS/NVM is almost double that for perf with 1 OSD/NVM.

https://mojo.redhat.com/groups/product-performance-scale-community-of-practice/blog/2018/01/31/bluestore-on-all-nvm-rados-bench#jive_content_id_OSDs_per_NVM_SSD

Note that I used ceph-volume, upstream ceph-ansible and Luminous 12.2.2 to achieve this result, not RHCS 3.0.

Comment 7 Alfredo Deza 2018-03-06 19:43:24 UTC

Technically, ceph-volume does this already, provided that the logical volumes are created before hand and then specified for ceph-ansible to consume.

Comment 8 Alfredo Deza 2018-03-15 18:58:44 UTC

Commenting again, this is plain impossible with ceph-disk, and ceph-volume is fully capable of handling it, given the LVs are made. Can we close this? Or can we get clarification on what else is needed here?

Comment 9 Sébastien Han 2018-03-15 19:04:22 UTC

I second what Alfredo is saying, if this is something supported out of the box by ceph-volume then there is no need for an RFE and this should be closed. We just need to make sure osp/ooo adds the support for ceph-volume.

Comment 10 Federico Lucifredi 2018-04-03 19:26:24 UTC

I think it is fine to require ceph-volume for NVME support.

This has documentation impact and an OSP impact, so not closing the bug (but someone else should feel free to slice and dice into two if that helps).

Comment 11 Ben England 2018-04-06 13:45:08 UTC

Federico, there is still a slight functionality gap.  ceph-volume supports LVM and therefore ceph-ansible would be fine, except that nothing will construct the LVM volumes that are fed to ceph-ansible at present.  Not hard to do, but where does this happen automatically during OOO deployment?

Comment 12 John Fulton 2018-04-09 12:52:39 UTC

What if ceph-ansible had a new feature which took parameters like this:

osd_lvm_count: 2

physical_volumes:
- /dev/nvme0n1
- /dev/nvme2n1
...

and then made "osd_lvm_count" LVs on each PV and then used those LVs as if they had originally been passed under the "devices" list?

Comment 13 John Fulton 2018-04-09 13:26:07 UTC

Proposal:

1. update docs so user manually does what is in comment #12
2. if some future version of ceph-ansible gets the feature from comment #12, then update docs to use feature instead

We might be able to provide a preboot script along with the docs update which would set up the LVs during step1 (we used to use a script like that to clean the disks during deployment -- ironic cleans them now). 

What do you think of comment #12 Seb?

Comment 14 John Fulton 2018-04-09 13:36:45 UTC

(In reply to leseb from comment #9)
> I second what Alfredo is saying, if this is something supported out of the
> box by ceph-volume then there is no need for an RFE and this should be
> closed. We just need to make sure osp/ooo adds the support for ceph-volume.

Yes, osp/ooo can ship the appropriate ceph versions which include ceph-volume. However, as per Ben's comment #11, shouldn't ceph-ansible set up the LVs?

Suppose OSP, were not in the picture and you have customers deploying Ceph in a new environment. Are you going to require as a prerequisite that they have the logical volumes already created on their systems? If so, they might configure those LVs inconsistently across their systems or not configure them optimally (e.g. too many LVs per PV). These are the types of problems which lead to the need for deployment tools which ensure it's done correctly in every deployment. For this reason, I am asking if what's in comment #12, or something like it, could be a future feature of ceph-ansible.

Comment 15 Sébastien Han 2018-04-19 09:15:46 UTC

Work for this is in-progress and will be solved by the introduction of choose_disk + the ability to create required PV/VG/LV.

The subject of the PR has changed a bit since now we are more talking about a set of pre-tasks that will create the create the PV/VG/LV.

Comment 18 Christina Meno 2018-06-04 15:47:56 UTC

Harish it's in 3.*, Would you please say where are you proposing I put it?

Comment 19 Harish NV Rao 2018-06-05 06:57:43 UTC

3.2, if it's going to be fixed there.

Comment 23 Randy Martinez 2018-06-12 01:21:01 UTC

John,

I don't think this is the case anymore. I've validated ceph-disk can in fact support non-collocated scenario w/NVMEs. Update the osds.yml to reflect the following:

osd_scenario: non-collocated
devices:
  - /dev/sdb
  - /dev/sdc
  - /dev/sdd
dedicated_devices:
  - /dev/nvme0n1
  - /dev/nvme0n1
  - /dev/nvme0n1

New partitions on nvme0n1 will be added automatically in line with journal_size configured.

Comment 25 Federico Lucifredi 2018-06-18 13:49:55 UTC

Outcome of triage is that this is a difficult objective for 3.0Z5.

Comment 26 Mike Hackett 2018-06-18 14:15:43 UTC

*** Bug 1588085 has been marked as a duplicate of this bug. ***

Comment 27 John Fulton 2018-07-17 12:39:37 UTC

(In reply to Randy Martinez from comment #23)
> John,
> 
> I don't think this is the case anymore. I've validated ceph-disk can in fact
> support non-collocated scenario w/NVMEs. Update the osds.yml to reflect the
> following:
> 
> osd_scenario: non-collocated
> devices:
>   - /dev/sdb
>   - /dev/sdc
>   - /dev/sdd
> dedicated_devices:
>   - /dev/nvme0n1
>   - /dev/nvme0n1
>   - /dev/nvme0n1
> 
> New partitions on nvme0n1 will be added automatically in line with
> journal_size configured.

Hi Randy,

That works but it's for something else. That's not what this bug is about. This bug is about passing a PV like /dev/nvme0n1 and a number, e.g. 4, and then having ceph-ansible create 4 LVs on that PV and then using those LVs as devices as if I had created them myself and then passed this:

devices:
  - lv1
  - lv2
  - lv3
  - lv4

For info on why you would do this see http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning

Comment 31 Ken Dreyer (Red Hat) 2018-09-28 18:24:03 UTC

What other changes must happen for ceph-ansible to resolve this BZ beyond the ceph-ansible v3.2.0beta2 release?

Comment 32 Sébastien Han 2018-10-01 09:56:28 UTC

None, but I'll let Andrew confirm since he recently added the support for "batch" in ceph-ansible. Thanks.

Comment 37 Sébastien Han 2018-10-19 08:11:35 UTC

Thanks Bara, the only thing to mention is that this is not supported by the containerized deployment. Thanks.

Comment 39 Sébastien Han 2018-10-22 14:41:05 UTC

I'm actually assigning this to Andrew since he did the implementation and I'll let him answer your question as well John.
Thanks.

Comment 40 Ben England 2018-10-22 17:59:08 UTC

I don't follow why containerized Ceph should be different, other than that it's site-docker.yml vs site.yml.

Comment 41 Sébastien Han 2018-10-23 08:42:15 UTC

Ben, it is different because batch does not support prepare only, see: http://tracker.ceph.com/issues/36363

Comment 43 Sébastien Han 2018-10-24 12:47:38 UTC

John, please do not forget to mention that this does not support containerized deployments. Thanks

Comment 49 John Fulton 2018-10-26 12:12:45 UTC

(In reply to leseb from comment #41)
> Ben, it is different because batch does not support prepare only, see:
> http://tracker.ceph.com/issues/36363

So support for this feature in containers depends on the above issue being completed.

As far as you know, when it is completed will that be sufficient and this feature will be supported with containers?

If so will that be tracked in a different bug?

  John

Comment 50 Andrew Schoen 2018-10-26 12:40:22 UTC

(In reply to John Fulton from comment #49)
> (In reply to leseb from comment #41)
> > Ben, it is different because batch does not support prepare only, see:
> > http://tracker.ceph.com/issues/36363
> 
> So support for this feature in containers depends on the above issue being
> completed.
> 
> As far as you know, when it is completed will that be sufficient and this
> feature will be supported with containers?
> 
> If so will that be tracked in a different bug?
> 
>   John

The 'ceph-volume lvm batch --prepare' feature is completed upstream, merged to master and currently being backported to luminous. Once it's merged to master we'll get it cherry-picked downstream.

Sebastian, when are you able to start on the container support for this?

Comment 51 Alfredo Deza 2018-10-26 13:42:13 UTC

Commits from https://github.com/ceph/ceph/pull/24587 have been pushed downstream

Comment 54 Sébastien Han 2018-10-26 14:33:11 UTC

Andrew, the patch is upstream, I just added it to the BZ.

Comment 56 Ken Dreyer (Red Hat) 2018-10-31 17:19:07 UTC

Seb added https://github.com/ceph/ceph-ansible/pull/3269 to this BZ, so I'm resetting Fixed In Version to ceph-ansible 3.2.0rc1.

Comment 60 Vasishta 2018-12-05 05:40:27 UTC

All planned testcases have been completed successfully, moving BZ to VERIFIED state.

Regards,
Vasishta Shastry
QE, Ceph

Comment 62 errata-xmlrpc 2019-01-03 19:01:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0020

Note You need to log in before you can comment on or make changes to this bug.

acalhoun
adeza
agunn
anharris
aschoen
bengland
ceph-eng-bugs
ceph-qe-bugs
ddharwar
flucifre
gfidente
gmeno
hnallurv
jbrier
johfulto
kdreyer
mhackett
nlevine
nthomas
pasik
r.martinez
sankarshan
shan
tserlin
vashastr
vumrao