1591074 – Support NVMe based bucket index pools

Bug 1591074 - Support NVMe based bucket index pools

Summary: Support NVMe based bucket index pools

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	3.1
Assignee:	Ali Maredia
QA Contact:	Tiffany Nguyen
Docs Contact:	Aron Gunn
URL:
Whiteboard:
Depends On:	1593868 1602919
Blocks:	1581350 1584264
TreeView+	depends on / blocked

Reported:	2018-06-14 02:22 UTC by John Harrigan
Modified:	2018-09-26 18:23 UTC (History)
CC List:	19 users (show)
Fixed In Version:	RHEL: ceph-ansible-3.1.0-0.1.rc18.el7cp Ubuntu: ceph-ansible_3.1.0~rc18-2redhat1
Doc Type:	Enhancement
Doc Text:	.Support NVMe based bucket index pools Previously, configuring Ceph to optimize storage on high speed NVMe or SATA SSDs when using Object Gateway was a completely manual process which required complicated LVM configuration. With this release, the `ceph-ansible` package provides two new Ansible playbooks that facilitate setting up SSD storage using LVM to optimize performance when using Object Gateway. See the link:{object-gw-production}#using-nvme-with-lvm-optimally[Using NVMe with LVM Optimally] chapter in the {product} Object Gateway for Production Guide for more information.
Clone Of:
Environment:
Last Closed:	2018-09-26 18:22:08 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sample osds.yml file showing osd_scenario=lvm format used in Scale Lab (13.11 KB, text/x-vhdl) 2018-07-27 14:27 UTC, John Harrigan	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 2922	None	closed	infrastructure-playbooks: playbooks for BI partitioning on NVMe	2021-02-12 11:02:37 UTC
Red Hat Bugzilla	1593868	high	CLOSED	Best practices updates for bucket index to be on same nvme as journal	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2018:2819	None	None	None	2018-09-26 18:23:34 UTC

Internal Links: 1593868

Description John Harrigan 2018-06-14 02:22:06 UTC

Description of problem:
Placing bucket index pools on NVMe devices is recommended by Engr and GSS for Ceph RGW usage. ceph-ansible should be extended and documentation provided to help customers accomplish this.

Ceph-volume automation gaps. User needs to manually partition devices, create LVM hierarchy and then pass that to ceph-ansible ‘group_vars/osds.yml’ file for deployment. Not documented in RHCS manuals, requires many steps and error prone.

In my cluster I developed a new ansible playbook which prepared the storage devices (partitions and LVM configuration). Then I had to manually edit the osds.yml file with details for osd_scenario=lvm.
The tooling I used can be found here
https://github.com/jharriga/BIprovision

Comment 4 Sébastien Han 2018-07-02 12:42:01 UTC

Yes, this should move to 3.2.
I'm re-targetting the work.

Thanks.

Comment 8 John Harrigan 2018-07-27 14:27:25 UTC

Created attachment 1471129 [details]
sample osds.yml file showing osd_scenario=lvm format used in Scale Lab

Comment 10 John Harrigan 2018-08-02 11:08:29 UTC

For QE testing I suggest the following:

Two hardware configurations:
  * 1 NVMe and (at least) four HDDs - one bucketIndex
  * 2 NVMe and (at least) four HDDs - two bucketIndexes

WORKFLOW:
1) Starting with these available raw block devices:
  * /dev/nvme0n1
  * /dev/sdb, /dev/sdc, /dev/sdd, /dev/sde
2) Make edits to the playbook to match the configuration
3) Run the playbook
4) Review LVM configuration. Ten LV's total:
  * one FSjournal LV per HDD (placed on NVMe) - 4 LV's on /dev/nvme0n1
  * one data LV per HDD (placed on each HDD) - one LV per HDD
  * one FSjournal LV for BucketIndex (placed on NVMe) - one LV on /dev/nvme0n1
  * one data LV for BucketIndex (placed on NVMe) - one LV on /dev/nvme0n1
5) Edit osds.yml file for "osd_scenario=lvm"
6) run ceph-ansible and verify successful deployment
7) purge-cluster
8) Run the playbook for teardown
9) Review the LVM configuration is removed (lvdisplay, pvdisplay)

Redo for the second configuration, this time using two NVMe devices.
  * /dev/nvme0n1, /dev/nvme1n1
  * /dev/sdb, /dev/sdc, /dev/sdd, /dev/sde
Make edits to the playbook for the first NVMe device.
Run the playbook.
Make edits to the playbook for the second NVMe device.
Run the playbook.

The FSjournal and BucketIndex LV's should be split across the two NVMe's. There should be two BucketIndex OSDs (one per NVMe device) for a total of twelve LVs. The LVM configuration should look like this:
  * one FSjournal LV per HDD (placed on both NVMe's) - two LV's on /dev/nvme0n1, two on nvme1n1
  * one data LV per HDD (placed on each HDD) - one LV per HDD
  * one FSjournal LV per BucketIndex (placed on btoh NVMe's) - one LV on /dev/nvme0n1, one on /dev/nvme1n1
  * one data LV per BucketIndex (placed on both NVMe's) - one LV on /dev/nvme0n1, one on /dev/nvme1n1

Comment 11 Ben England 2018-08-02 14:02:44 UTC

To add to John's comment, I know that teardown *seems* like an insignificant thing, because the purpose of ceph-ansible is to set up a cluster, not tear it down.  After all, ansible playbooks are supposed to be idempotent (doing it twice same as doing it once).

However, in practice, we find that if you don't have tear-down capabilities (i.e. infrastructure-playbooks/*purge-cluster*), then if setup fails, or if the wrong configuration was established, you often have no way to undo the damage.  In a CI virtualized environment, it's not necessary, you just create new VMs and new virtual drives and start over.  But in the bare-metal world of real hardware, you can't do that.  We could all write your own scripts to re-init storage, but that's exactly what we're trying to avoid; we don't want to have every ceph-ansible user write their own automation to do tear-down, because it's really hard to do right.  

Ansible is memory-less - it has no innate way of knowing what configuration was previously used, so it cannot and does not know how to unwind any previous configuration before establishing a new configuration.  But with a teardown script, you can purge the old configuration, then change the inputs to ceph-ansible, then run site.yml to establish the new configuration.

So please have pity on us poor souls who live outside the sunny sim-world of CI ;-)

Comment 16 Sébastien Han 2018-08-16 15:06:46 UTC

In https://github.com/ceph/ceph-ansible/releases/tag/v3.1.0rc18

Comment 20 John Harrigan 2018-08-21 17:48:42 UTC

it's worth noting that current playbook only addresses filestore based clusters
since it creates LVs for FSjournals and bucket indexes.

In RHCS 3.2 bluestore will be supported and likely the default.
How will the logical volumes be created in that release?
Will this playbook need to be extended to support bluestore, which requires two LVs (WAL and DB) as well as bucket index on NVMe ?

Comment 21 Josh Durgin 2018-08-21 18:28:51 UTC

(In reply to John Harrigan from comment #20)
> it's worth noting that current playbook only addresses filestore based
> clusters
> since it creates LVs for FSjournals and bucket indexes.
> 
> In RHCS 3.2 bluestore will be supported and likely the default.
> How will the logical volumes be created in that release?
> Will this playbook need to be extended to support bluestore, which requires
> two LVs (WAL and DB) as well as bucket index on NVMe ?

If we need changes for 3.2 + bluestore, let's open a new bz for them. To clarify, only one LV is needed in the common case of one fast device and one slow device - if you have a DB LV, the WAL will be stored there.

Comment 22 Sébastien Han 2018-08-21 18:59:15 UTC

(In reply to John Harrigan from comment #20)
> it's worth noting that current playbook only addresses filestore based
> clusters
> since it creates LVs for FSjournals and bucket indexes.
> 
> In RHCS 3.2 bluestore will be supported and likely the default.
> How will the logical volumes be created in that release?
> Will this playbook need to be extended to support bluestore, which requires
> two LVs (WAL and DB) as well as bucket index on NVMe ?

The plan is to have ceph-volume handling the LV's creation for 3.2 so this won't need to be extended. Although, this needs a BZ.

Comment 23 John Harrigan 2018-08-21 19:28:32 UTC

(In reply to leseb from comment #22)
> (In reply to John Harrigan from comment #20)
> > it's worth noting that current playbook only addresses filestore based
> > clusters
> > since it creates LVs for FSjournals and bucket indexes.
> > 
> > In RHCS 3.2 bluestore will be supported and likely the default.
> > How will the logical volumes be created in that release?
> > Will this playbook need to be extended to support bluestore, which requires
> > two LVs (WAL and DB) as well as bucket index on NVMe ?
> 
> The plan is to have ceph-volume handling the LV's creation for 3.2 so this
> won't need to be extended. Although, this needs a BZ.

opened new BZ
https://bugzilla.redhat.com/show_bug.cgi?id=1619812

Comment 25 Tiffany Nguyen 2018-08-27 16:22:46 UTC

Verified using build 12.2.5-39.el7cp.
Both scenarios are used to verify as comment #10.
  Two hardware configurations:
  * 1 NVMe and (at least) four HDDs - one bucketIndex
  * 2 NVMe and (at least) four HDDs - two bucketIndexes

Seeing issue of "device excluded by a filter” while running "ansible-playbook lv-create.yml".  Workaround: run "wipefs -a" for all devices on OSD nodes to remove any FS/GPT signatures.  This needs to be addressed.

Other than that, everything is working as expected.

Comment 26 Vasu Kulkarni 2018-08-28 14:21:25 UTC

I think its related to Bz 1619090

Comment 35 John Harrigan 2018-09-06 13:17:11 UTC

These doc updates should mention that they only apply to filestore, not bluestore.

Chapter 10. Using NVMe with LVM Optimally

Comment 38 errata-xmlrpc 2018-09-26 18:22:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2819

Note You need to log in before you can comment on or make changes to this bug.