Description of problem: Placing bucket index pools on NVMe devices is recommended by Engr and GSS for Ceph RGW usage. ceph-ansible should be extended and documentation provided to help customers accomplish this. Ceph-volume automation gaps. User needs to manually partition devices, create LVM hierarchy and then pass that to ceph-ansible ‘group_vars/osds.yml’ file for deployment. Not documented in RHCS manuals, requires many steps and error prone. In my cluster I developed a new ansible playbook which prepared the storage devices (partitions and LVM configuration). Then I had to manually edit the osds.yml file with details for osd_scenario=lvm. The tooling I used can be found here https://github.com/jharriga/BIprovision
Yes, this should move to 3.2. I'm re-targetting the work. Thanks.
Created attachment 1471129 [details] sample osds.yml file showing osd_scenario=lvm format used in Scale Lab
For QE testing I suggest the following: Two hardware configurations: * 1 NVMe and (at least) four HDDs - one bucketIndex * 2 NVMe and (at least) four HDDs - two bucketIndexes WORKFLOW: 1) Starting with these available raw block devices: * /dev/nvme0n1 * /dev/sdb, /dev/sdc, /dev/sdd, /dev/sde 2) Make edits to the playbook to match the configuration 3) Run the playbook 4) Review LVM configuration. Ten LV's total: * one FSjournal LV per HDD (placed on NVMe) - 4 LV's on /dev/nvme0n1 * one data LV per HDD (placed on each HDD) - one LV per HDD * one FSjournal LV for BucketIndex (placed on NVMe) - one LV on /dev/nvme0n1 * one data LV for BucketIndex (placed on NVMe) - one LV on /dev/nvme0n1 5) Edit osds.yml file for "osd_scenario=lvm" 6) run ceph-ansible and verify successful deployment 7) purge-cluster 8) Run the playbook for teardown 9) Review the LVM configuration is removed (lvdisplay, pvdisplay) Redo for the second configuration, this time using two NVMe devices. * /dev/nvme0n1, /dev/nvme1n1 * /dev/sdb, /dev/sdc, /dev/sdd, /dev/sde Make edits to the playbook for the first NVMe device. Run the playbook. Make edits to the playbook for the second NVMe device. Run the playbook. The FSjournal and BucketIndex LV's should be split across the two NVMe's. There should be two BucketIndex OSDs (one per NVMe device) for a total of twelve LVs. The LVM configuration should look like this: * one FSjournal LV per HDD (placed on both NVMe's) - two LV's on /dev/nvme0n1, two on nvme1n1 * one data LV per HDD (placed on each HDD) - one LV per HDD * one FSjournal LV per BucketIndex (placed on btoh NVMe's) - one LV on /dev/nvme0n1, one on /dev/nvme1n1 * one data LV per BucketIndex (placed on both NVMe's) - one LV on /dev/nvme0n1, one on /dev/nvme1n1
To add to John's comment, I know that teardown *seems* like an insignificant thing, because the purpose of ceph-ansible is to set up a cluster, not tear it down. After all, ansible playbooks are supposed to be idempotent (doing it twice same as doing it once). However, in practice, we find that if you don't have tear-down capabilities (i.e. infrastructure-playbooks/*purge-cluster*), then if setup fails, or if the wrong configuration was established, you often have no way to undo the damage. In a CI virtualized environment, it's not necessary, you just create new VMs and new virtual drives and start over. But in the bare-metal world of real hardware, you can't do that. We could all write your own scripts to re-init storage, but that's exactly what we're trying to avoid; we don't want to have every ceph-ansible user write their own automation to do tear-down, because it's really hard to do right. Ansible is memory-less - it has no innate way of knowing what configuration was previously used, so it cannot and does not know how to unwind any previous configuration before establishing a new configuration. But with a teardown script, you can purge the old configuration, then change the inputs to ceph-ansible, then run site.yml to establish the new configuration. So please have pity on us poor souls who live outside the sunny sim-world of CI ;-)
In https://github.com/ceph/ceph-ansible/releases/tag/v3.1.0rc18
it's worth noting that current playbook only addresses filestore based clusters since it creates LVs for FSjournals and bucket indexes. In RHCS 3.2 bluestore will be supported and likely the default. How will the logical volumes be created in that release? Will this playbook need to be extended to support bluestore, which requires two LVs (WAL and DB) as well as bucket index on NVMe ?
(In reply to John Harrigan from comment #20) > it's worth noting that current playbook only addresses filestore based > clusters > since it creates LVs for FSjournals and bucket indexes. > > In RHCS 3.2 bluestore will be supported and likely the default. > How will the logical volumes be created in that release? > Will this playbook need to be extended to support bluestore, which requires > two LVs (WAL and DB) as well as bucket index on NVMe ? If we need changes for 3.2 + bluestore, let's open a new bz for them. To clarify, only one LV is needed in the common case of one fast device and one slow device - if you have a DB LV, the WAL will be stored there.
(In reply to John Harrigan from comment #20) > it's worth noting that current playbook only addresses filestore based > clusters > since it creates LVs for FSjournals and bucket indexes. > > In RHCS 3.2 bluestore will be supported and likely the default. > How will the logical volumes be created in that release? > Will this playbook need to be extended to support bluestore, which requires > two LVs (WAL and DB) as well as bucket index on NVMe ? The plan is to have ceph-volume handling the LV's creation for 3.2 so this won't need to be extended. Although, this needs a BZ.
(In reply to leseb from comment #22) > (In reply to John Harrigan from comment #20) > > it's worth noting that current playbook only addresses filestore based > > clusters > > since it creates LVs for FSjournals and bucket indexes. > > > > In RHCS 3.2 bluestore will be supported and likely the default. > > How will the logical volumes be created in that release? > > Will this playbook need to be extended to support bluestore, which requires > > two LVs (WAL and DB) as well as bucket index on NVMe ? > > The plan is to have ceph-volume handling the LV's creation for 3.2 so this > won't need to be extended. Although, this needs a BZ. opened new BZ https://bugzilla.redhat.com/show_bug.cgi?id=1619812
Verified using build 12.2.5-39.el7cp. Both scenarios are used to verify as comment #10. Two hardware configurations: * 1 NVMe and (at least) four HDDs - one bucketIndex * 2 NVMe and (at least) four HDDs - two bucketIndexes Seeing issue of "device excluded by a filter” while running "ansible-playbook lv-create.yml". Workaround: run "wipefs -a" for all devices on OSD nodes to remove any FS/GPT signatures. This needs to be addressed. Other than that, everything is working as expected.
I think its related to Bz 1619090
These doc updates should mention that they only apply to filestore, not bluestore. Chapter 10. Using NVMe with LVM Optimally
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2819