Bug 1650306
Summary: | unable to use ceph-volume lvm batch on OSD systems w/HDDs and multiple NVMe devices | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | John Harrigan <jharriga> | ||||||||||||||||||
Component: | Ceph-Volume | Assignee: | Alfredo Deza <adeza> | ||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Tiffany Nguyen <tunguyen> | ||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||
Priority: | high | ||||||||||||||||||||
Version: | 3.2 | CC: | adeza, agunn, aschoen, bengland, ceph-eng-bugs, ceph-qe-bugs, dfuller, gmeno, hnallurv, jbrier, jharriga, kdreyer, mhackett, pasik, seb, shan, tserlin, tunguyen, vakulkar, vashastr | ||||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||||
Target Release: | 3.2 | ||||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||
Fixed In Version: | RHEL: ceph-ansible-3.2.0-0.1.rc5.el7cp Ubuntu: ceph-ansible_3.2.0~rc5-2redhat1 | Doc Type: | If docs needed, set a value | ||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||
Last Closed: | 2019-01-03 19:02:22 UTC | Type: | Bug | ||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||
Embargoed: | |||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||
Bug Blocks: | 1641792 | ||||||||||||||||||||
Attachments: |
|
@Jhon the message: > Aborting because strategy changed from bluestore.MixedType to bluestore.SingleType after filtering Means that ceph-volume looked at the devices and determined that the strategy would be to place data in the spinning drives and block.db on the NVMe devices. However, *filtering* happened, which I am assuming it removed/filtered-out the NVMe devices, leaving just the spinning ones. This change is detected and it correctly refuses to continue since the end result would be to ignore the NVMe devices and consuming the spinning drives fully. There are a few reasons why the NVMe devices were excluded, but could you run the report command to get more information? For example: > ceph-volume --cluster ceph lvm batch --bluestore --yes /dev/sdc /dev/sdd /dev/nvme0n1 /dev/sdq /dev/sdr /dev/nvme1n1 --report And just in case the json report is richer: > ceph-volume --cluster ceph lvm batch --bluestore --yes /dev/sdc /dev/sdd /dev/nvme0n1 /dev/sdq /dev/sdr /dev/nvme1n1 --report --format=json Created attachment 1506227 [details]
text 'ceph lvm batch' report
(In reply to John Harrigan from comment #4) > Created attachment 1506227 [details] > text 'ceph lvm batch' report Both reports failed to provide any information $ cat report.txt --> Aborting because strategy changed from bluestore.MixedType to bluestore.SingleType after filtering $ cat report.json --> Aborting because strategy changed from bluestore.MixedType to bluestore.SingleType after filtering John, that sounds like something we need to improve, thanks for catching that. Can you add the /var/log/ceph/ceph-volume.log file to this ticket? The reasons for filtering the devices should all be logged there the ceph-ansible run leaves this cluster, all HDD based. How should I specify the configuration in osds.yml to get the intended configuration of ceph-ansible deploys bluestore cluster with two HDDs paired to each of the two NVMe devices # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 87.30176 root default -17 7.27515 host c05-h29-6048r 7 hdd 1.81879 osd.7 up 1.00000 1.00000 19 hdd 1.81879 osd.19 up 1.00000 1.00000 31 hdd 1.81879 osd.31 up 1.00000 1.00000 43 hdd 1.81879 osd.43 up 1.00000 1.00000 -21 7.27515 host c06-h01-6048r 8 hdd 1.81879 osd.8 up 1.00000 1.00000 20 hdd 1.81879 osd.20 up 1.00000 1.00000 32 hdd 1.81879 osd.32 up 1.00000 1.00000 45 hdd 1.81879 osd.45 up 1.00000 1.00000 -19 7.27515 host c06-h05-6048r 9 hdd 1.81879 osd.9 up 1.00000 1.00000 21 hdd 1.81879 osd.21 up 1.00000 1.00000 33 hdd 1.81879 osd.33 up 1.00000 1.00000 46 hdd 1.81879 osd.46 up 1.00000 1.00000 -25 7.27515 host c06-h09-6048r 10 hdd 1.81879 osd.10 up 1.00000 1.00000 22 hdd 1.81879 osd.22 up 1.00000 1.00000 34 hdd 1.81879 osd.34 up 1.00000 1.00000 44 hdd 1.81879 osd.44 up 1.00000 1.00000 -23 7.27515 host c06-h13-6048r 11 hdd 1.81879 osd.11 up 1.00000 1.00000 23 hdd 1.81879 osd.23 up 1.00000 1.00000 35 hdd 1.81879 osd.35 up 1.00000 1.00000 47 hdd 1.81879 osd.47 up 1.00000 1.00000 -3 7.27515 host c07-h01-6048r 0 hdd 1.81879 osd.0 up 1.00000 1.00000 12 hdd 1.81879 osd.12 up 1.00000 1.00000 26 hdd 1.81879 osd.26 up 1.00000 1.00000 37 hdd 1.81879 osd.37 up 1.00000 1.00000 -7 7.27515 host c07-h05-6048r 3 hdd 1.81879 osd.3 up 1.00000 1.00000 13 hdd 1.81879 osd.13 up 1.00000 1.00000 25 hdd 1.81879 osd.25 up 1.00000 1.00000 38 hdd 1.81879 osd.38 up 1.00000 1.00000 -5 7.27515 host c07-h09-6048r 1 hdd 1.81879 osd.1 up 1.00000 1.00000 14 hdd 1.81879 osd.14 up 1.00000 1.00000 24 hdd 1.81879 osd.24 up 1.00000 1.00000 36 hdd 1.81879 osd.36 up 1.00000 1.00000 -15 7.27515 host c07-h13-6048r 2 hdd 1.81879 osd.2 up 1.00000 1.00000 15 hdd 1.81879 osd.15 up 1.00000 1.00000 30 hdd 1.81879 osd.30 up 1.00000 1.00000 40 hdd 1.81879 osd.40 up 1.00000 1.00000 -11 7.27515 host c07-h17-6048r 4 hdd 1.81879 osd.4 up 1.00000 1.00000 16 hdd 1.81879 osd.16 up 1.00000 1.00000 27 hdd 1.81879 osd.27 up 1.00000 1.00000 41 hdd 1.81879 osd.41 up 1.00000 1.00000 -9 7.27515 host c07-h21-6048r 5 hdd 1.81879 osd.5 up 1.00000 1.00000 17 hdd 1.81879 osd.17 up 1.00000 1.00000 28 hdd 1.81879 osd.28 up 1.00000 1.00000 42 hdd 1.81879 osd.42 up 1.00000 1.00000 -13 7.27515 host c07-h25-6048r 6 hdd 1.81879 osd.6 up 1.00000 1.00000 18 hdd 1.81879 osd.18 up 1.00000 1.00000 29 hdd 1.81879 osd.29 up 1.00000 1.00000 39 hdd 1.81879 osd.39 up 1.00000 1.00000 Created attachment 1506268 [details]
ceph volume log
What configuration are you using so that cehp-ansible deploys only to the HDDs? I extended the /root/wipefs_6048r.sh script to include this dd if=/dev/zero of=/dev/$device bs=1M count=1 Then ran this sequence of cmds and got the same result: # ansible-playbook purge-cluster.yml # ansible osds -m script -a "/root/wipefs_6048.sh" # ansible-playbook site.yml 2>&1 | tee -a Deploy2nvme.Nov15zap TASK [ceph-config : run 'ceph-volume lvm batch --report' to see how many osds are to be created] *** Friday 16 November 2018 00:02:14 +0000 (0:00:02.430) 0:34:58.752 ******* fatal: [c07-h01-6048r]: FAILED! => {"changed": true, "cmd": ["ceph-volume", "--cluster", "ceph", "lvm", "batch", "--bluestore", "--yes", "/dev/sdc", "/dev/sdd", "/dev/nvme0n1", "/dev/sdq", "/dev/sdr", "/dev/nvme1n1", "--report", "--format=json"], "msg": "non-zero return code", "rc": 1, "stderr": "", "stderr_lines": [], "stdout": "--> Aborting because strategy changed from bluestore.MixedType to bluestore.SingleType after filtering", "stdout_lines": ["--> Aborting because strategy changed from bluestore.MixedType to bluestore.SingleType after filtering"]} so no change there. Next up I will update to the latest ceph-ansible since this commit purge-cluster: zap devices used with the lvm scenario https://github.com/ceph/ceph-ansible/commit /9747f3dbd5a2eada543a6f61e482e005b6660016 is in ceph-ansible-3.2.0-0.1.rc2.el7cp.noarch.rpm. I upgraded to ceph-ansible-rc2, ran the purge-cluster and deploy. Unfortunately I had the same result, an early exit from ceph-ansible with this message TASK [ceph-config : run 'ceph-volume lvm batch --report' to see how many osds are to be created] *** Friday 16 November 2018 15:54:33 +0000 (0:00:01.615) 0:34:51.107 ******* fatal: [c07-h01-6048r]: FAILED! => {"changed": true, "cmd": ["ceph-volume", "--cluster", "ceph", "lvm", "batch", "--bluestore", "--yes", "/dev/sdc", "/dev/sdd", "/dev/nvme0n1", "/dev/sdq", "/dev/sdr", "/dev/nvme1n1", "--report", "--format=json"], "msg": "non-zero return code", "rc": 1, "stderr": "", "stderr_lines": [], "stdout": "--> Aborting because strategy changed from bluestore.MixedType to bluestore.SingleType after filtering", "stdout_lines": ["--> Aborting because strategy changed from bluestore.MixedType to bluestore.SingleType after filtering"]} I have attached the logs from the purge and deploy, both performed using rc2 version of ceph-ansible. Here are the cmds I used to perform this test --------------------------------------------- # cd /root; wget http://download.eng.bos.redhat.com/composes/auto/ceph-3.2-rhel-7/RHCEPH-3.2-RHEL-7-20181115.ci.1/compose/Tools/x86_64/os/Packages/ceph-ansible-3.2.0-0.1.rc2.el7cp.noarch.rpm # rpm -Uvh ceph-ansible-3.2.0-0.1.rc2.el7cp.noarch.rpm # yum list ceph-ansible ceph-ansible.noarch 3.2.0-0.1.rc2.el7cp Purge and redeploy using ceph-ansible.rc2 # ssh c07-h01-6048r ceph -s ← 48 OSDs # ansible-playbook purge-cluster.yml 2>&1 | tee -a PurgeRC2.Nov16 # ansible-playbook site.yml 2>&1 | tee -a DeployRC2.Nov16 Created attachment 1506461 [details]
purge runlog using ceph-ansibleRC2
Created attachment 1506462 [details]
deploy runlog using ceph-ansibleRC2
I decided to try this same sequence of cmds but this time only specify ONE NVMe device. Sure enough, the deploy work like a charm. # yum list ceph-ansible ceph-ansible.noarch 3.2.0-0.1.rc2.el7cp # ansible-playbook purge-cluster.yml # ansible-playbook site.yml 2>&1 | tee -a DeployOneNVMe.Nov16 No Errors TASK [show ceph status for cluster ceph] *************************************** Friday 16 November 2018 19:37:58 +0000 (0:00:00.542) 0:55:45.555 ******* ok: [c05-h33-6018r -> c05-h33-6018r] => { "msg": [ " cluster:", " id: 2f9e9148-125e-4783-ab30-6fcd121aca01", " health: HEALTH_OK", " ", " services:", " mon: 3 daemons, quorum c05-h33-6018r,c06-h29-6018r,c07-h29-6018r", " mgr: c07-h30-6018r(active)", " osd: 144 osds: 144 up, 144 in", " rgw: 12 daemons active", " ", " data:", " pools: 4 pools, 32 pgs", " objects: 209 objects, 12.1KiB", " usage: 147GiB used, 262TiB / 262TiB avail", " pgs: 32 active+clean", " " ] } INSTALLER STATUS *************************************************************** Install Ceph Monitor : Complete (0:05:10) Install Ceph Manager : Complete (0:03:38) Install Ceph OSD : Complete (0:16:18) Install Ceph RGW : Complete (0:05:06) Install Ceph Client : Complete (0:18:34) --------------------------------- The osds.yml file looks like this --------------------------------- osd_objectstore: bluestore # use 'ceph-volume lvm batch' mode osd_scenario: lvm devices: - /dev/sdc - /dev/sdd - /dev/sde - /dev/sdf - /dev/sdg - /dev/sdh - /dev/sdi - /dev/sdj - /dev/sdk - /dev/sdl - /dev/sdm - /dev/sdn - /dev/nvme0n1 Created attachment 1506531 [details]
Success using ceph-ansibleRC2 w/one NVMe
When using two NVMe devices, in the case that fails, you've mentioned that the steps are:
* purge -> ansible-playbook purge-cluster.yml
* wipefs script -> ansible osds -m script -a "/root/wipefs_6048.sh"
* deploy -> ansible-playbook site.yml
The way the ansible implementation works is by checking if the input of devices will change the deployment strategy after filtering. For example:
input: /dev/sda /dev/sdb /dev/nvme0n1
And /dev/nvme0n1 gets filtered out, then this changes the stragey from "mixed devices" (spinning and solid), to "single type" (only one type of device). That would cause an immediate halt to the playbook.
If devices are getting filtered out that don't change the strategy then *there should not be an error*. For example:
input: /dev/sda /dev/sdb /dev/nvme0n1 /dev/nvme1n1
And /dev/nvme0n1 gets filtered, the strategy doesn't change, because there is still a mixed type group of devices.
So something is out of whack here when you say you try with 2 NVMe devices and things don't work, and then try with 1 NVMe device and it works.
One thing I would try is call the report after purging+wipefs, to see if devices are getting filtered out:
> ceph-volume lvm batch --report --format=json /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/nvme0n1
If that runs *before* deployment, and *after* purging+wipefs, it should work.
The example call I used was with one nvme device, but I would try with two as well and toy around with what the reporting says. The JSON reporting is more verbose for us (and useful for this BZ) but you might find it easier to use the normal (pretty) reporting How do I run ceph-volume after purging? The cmd will no longer be installed on the OSD nodes. Am I understanding the syntax (and ordering) in osds.yml correctly? For example, using this: osd_objectstore: bluestore # use 'ceph-volume lvm batch' mode osd_scenario: lvm devices: - /dev/sdc - /dev/sdd - /dev/nvme0n1 - /dev/sdq - /dev/sdr - /dev/nvme1n1 I would expect that the first two HDDs would hold 'block' portions of OSDs and /dev/nvme0n1 would house their 'db'. The next two HDDs (sdq and sdr) would be another set of 'block' OSDs and their 'db' would be on /dev/nvme1n1. Should I be specifying devices in a different order? Again the cfg I am trying to get to is to have 4 HDDs in use as OSDs and place two of their 'db' on one nvme and the other two 'db' on the other nvme. thanks Another thing that would help the output is to increase the verbosity with -vv
and export:
> ANSIBLE_STDOUT_CALLBACK=debug
John, the ordering doesn't matter, the batch sub-command will create one large VG for both NVMe devices. You cannot split them like that in one go. If you really want to you could try a first run with half of it and the rest with the other half. "batch" mode is not meant to be super flexible, it allows you to do without the LV creation, which takes a great deal of code to implement right so the constraints are there to allow a robust execution. But with LVM it is possible to state that you want a particular LV created on a particular PV within the VG. So it is possible to round-robin the RocksDB and journaling LVs across the available SSD PVs, even if all SSDs are in the same VG. For example, if you have an array of PVs, you can index into the SSD PV list using the LV's array index modulo N where N is the number of SSD PVs. This is really critical functionality for ceph-ansible. For example, one Ceph site that I know of has 60 HDDs/host, they will definitely have to balance their Bluestore RocksDB and journal space evenly across the SSD devices, and this will not be possible with this release of ceph-ansible if I understand John H correctly. Failure to implement this results in very substandard performance where the SSD becomes the bottleneck instead of the HDD, particularly for sequential writes or large random writes. @Ben, you are right that LVM does allow all of these configurations. The `lvm batch` sub-command was *not* meant to allow further customization from what it already offers. That is why we have the ability to receive pre-made LVs to produce OSDs: so that ceph-volume doesn't need to accommodate every configuration possible with LVM. In the end it is a decision between an easy deployment and a highly configurable one, we can't do both in `lvm batch` Having said that, if we really want to push forward with more configurable scenarios and LVM, then that should probably go into ceph-ansible, that way the LVs could just be consumed in whatever way they were produced. The purge-cluster.yml output indicates that it *did* run the osd lvm zap commands on the OSD nodes. The complete output from the purge run is in attachment "purge runlog using ceph-ansible RC2" Here is an excerpt TASK [zap and destroy osds created by ceph-volume with devices] **************** Friday 16 November 2018 15:10:22 +0000 (0:00:01.692) 0:01:23.648 ******* changed: [c07-h09-6048r] => (item=/dev/sdc) changed: [c07-h05-6048r] => (item=/dev/sdc) changed: [c07-h01-6048r] => (item=/dev/sdc) changed: [c07-h17-6048r] => (item=/dev/sdc) changed: [c07-h13-6048r] => (item=/dev/sdc) changed: [c07-h09-6048r] => (item=/dev/sdd) changed: [c07-h25-6048r] => (item=/dev/sdc) changed: [c07-h05-6048r] => (item=/dev/sdd) changed: [c07-h21-6048r] => (item=/dev/sdc) changed: [c07-h01-6048r] => (item=/dev/sdd) changed: [c07-h17-6048r] => (item=/dev/sdd) changed: [c06-h01-6048r] => (item=/dev/sdc) changed: [c07-h13-6048r] => (item=/dev/sdd) changed: [c05-h29-6048r] => (item=/dev/sdc) changed: [c06-h05-6048r] => (item=/dev/sdc) changed: [c07-h25-6048r] => (item=/dev/sdd) changed: [c07-h21-6048r] => (item=/dev/sdd) changed: [c07-h09-6048r] => (item=/dev/nvme0n1) changed: [c06-h09-6048r] => (item=/dev/sdc) changed: [c07-h05-6048r] => (item=/dev/nvme0n1) changed: [c07-h01-6048r] => (item=/dev/nvme0n1) changed: [c06-h01-6048r] => (item=/dev/sdd) changed: [c06-h13-6048r] => (item=/dev/sdc) changed: [c05-h29-6048r] => (item=/dev/sdd) changed: [c07-h17-6048r] => (item=/dev/nvme0n1) changed: [c06-h05-6048r] => (item=/dev/sdd) changed: [c07-h13-6048r] => (item=/dev/nvme0n1) changed: [c07-h09-6048r] => (item=/dev/sdq) changed: [c07-h05-6048r] => (item=/dev/sdq) changed: [c06-h09-6048r] => (item=/dev/sdd) changed: [c07-h21-6048r] => (item=/dev/nvme0n1) changed: [c07-h01-6048r] => (item=/dev/sdq) changed: [c06-h13-6048r] => (item=/dev/sdd) changed: [c07-h17-6048r] => (item=/dev/sdq) changed: [c07-h13-6048r] => (item=/dev/sdq) changed: [c07-h25-6048r] => (item=/dev/nvme0n1) changed: [c07-h09-6048r] => (item=/dev/sdr) changed: [c07-h05-6048r] => (item=/dev/sdr) changed: [c06-h01-6048r] => (item=/dev/nvme0n1) changed: [c05-h29-6048r] => (item=/dev/nvme0n1) changed: [c07-h21-6048r] => (item=/dev/sdq) changed: [c07-h01-6048r] => (item=/dev/sdr) changed: [c06-h05-6048r] => (item=/dev/nvme0n1) changed: [c07-h17-6048r] => (item=/dev/sdr) changed: [c07-h09-6048r] => (item=/dev/nvme1n1) changed: [c07-h13-6048r] => (item=/dev/sdr) changed: [c06-h09-6048r] => (item=/dev/nvme0n1) changed: [c07-h05-6048r] => (item=/dev/nvme1n1) changed: [c05-h29-6048r] => (item=/dev/sdq) changed: [c06-h01-6048r] => (item=/dev/sdq) changed: [c07-h01-6048r] => (item=/dev/nvme1n1) changed: [c07-h21-6048r] => (item=/dev/sdr) changed: [c07-h25-6048r] => (item=/dev/sdq) changed: [c06-h05-6048r] => (item=/dev/sdq) changed: [c06-h13-6048r] => (item=/dev/nvme0n1) changed: [c07-h17-6048r] => (item=/dev/nvme1n1) changed: [c07-h13-6048r] => (item=/dev/nvme1n1) changed: [c06-h09-6048r] => (item=/dev/sdq) changed: [c07-h21-6048r] => (item=/dev/nvme1n1) changed: [c05-h29-6048r] => (item=/dev/sdr) changed: [c06-h01-6048r] => (item=/dev/sdr) changed: [c07-h25-6048r] => (item=/dev/sdr) changed: [c06-h05-6048r] => (item=/dev/sdr) changed: [c06-h13-6048r] => (item=/dev/sdq) changed: [c06-h09-6048r] => (item=/dev/sdr) changed: [c05-h29-6048r] => (item=/dev/nvme1n1) changed: [c06-h01-6048r] => (item=/dev/nvme1n1) changed: [c07-h25-6048r] => (item=/dev/nvme1n1) changed: [c06-h05-6048r] => (item=/dev/nvme1n1) changed: [c06-h13-6048r] => (item=/dev/sdr) changed: [c06-h09-6048r] => (item=/dev/nvme1n1) changed: [c06-h13-6048r] => (item=/dev/nvme1n1) I remain concerned that purge is not fully cleaning up to allow a clean redeploy when using ceph-volume lvm batch mode. - John Could you ensure you are running with:
> ANSIBLE_STDOUT_CALLBACK=debug
And with the -vv flag in ansible?
John, This PR should handle the issue we found with subsequent deploys not being idempotent because of the strategy change. https://github.com/ceph/ceph-ansible/pull/3348 maybe blocker -- investigating now we can't reproduce this by deploying on a new system with several NVMe devices, there is a chance the usage of purge+redeploy might hit this, and there is a fix already for the idempotency issue that was found. We don't think this is a blocker Did some additional testing today using RC4. I am still not able to deploy using ceph lvm batch Tested deploy/purge cycle first with ceph-disk (successful) and then ceph-volume (failed). I will attach the full ceph-ansible logfile "DeployLVMbatchRC4.Nov28" RHCS 3.2 CEPH-DISK : ceph-ansible.noarch 3.2.0-0.1.rc2.el7cp ================== 1) Purged existing RHCS 3.2 cluster (ceph-disk non-collocated, bluestore) 2) Deployed RHCS 3.2 cluster (ceph-disk non-collocated, bluestore) SUCCESS - no failed tasks and running cluster with expected number OSDs and RGWs 3) Purged existing RHCS 3.2 cluster (ceph-disk non-collocated, bluestore) ===================================> CEPH-VOLUME <====================== 4) Installed latest ceph-ansible (RC4) # yum update ceph-ansible ceph-ansible.noarch 3.2.0-0.1.rc4.el7cp 5) Deployed RHCS 3.2 cluster (ceph-volume lvm batch, bluestore) # cat osds.yml #-------------------------------------------------------------------- osd_objectstore: bluestore # use 'ceph-volume lvm batch' mode osd_scenario: lvm devices: - /dev/sdc - /dev/sdd - /dev/nvme0n1 - /dev/sdq - /dev/sdr - /dev/nvme1n1 # export ANSIBLE_STDOUT_CALLBACK=debug # ansible-playbook -vv site.yml 2>&1 | tee -a DeployLVMbatchRC4.Nov28 <...SNIP...> TASK [ceph-config : generate ceph configuration file: ceph.conf] *************** task path: /usr/share/ceph-ansible/roles/ceph-config/tasks/main.yml:77 Wednesday 28 November 2018 18:20:33 +0000 (0:00:01.020) 0:15:19.542 **** An exception occurred during task execution. To see the full traceback, use -vvv. The error was: [line 16]: u' # non_hci_safety_factor is the safety factor for dedicated nodes\n' fatal: [c07-h01-6048r]: FAILED! => {} MSG: Unexpected failure during module execution. PLAY RECAP ********************************************************************* c03-h15-r620 : ok=22 changed=1 unreachable=0 failed=0 c03-h17-r620 : ok=22 changed=1 unreachable=0 failed=0 c03-h19-r620 : ok=22 changed=1 unreachable=0 failed=0 c03-h21-r620 : ok=22 changed=1 unreachable=0 failed=0 c04-h33-6018r : ok=22 changed=1 unreachable=0 failed=0 c05-h33-6018r : ok=93 changed=12 unreachable=0 failed=0 c06-h29-6018r : ok=83 changed=10 unreachable=0 failed=0 c07-h01-6048r : ok=60 changed=6 unreachable=0 failed=1 c07-h05-6048r : ok=57 changed=6 unreachable=0 failed=1 c07-h09-6048r : ok=57 changed=6 unreachable=0 failed=1 c07-h13-6048r : ok=57 changed=6 unreachable=0 failed=1 c07-h17-6048r : ok=57 changed=6 unreachable=0 failed=1 c07-h21-6048r : ok=57 changed=6 unreachable=0 failed=1 c07-h25-6048r : ok=57 changed=6 unreachable=0 failed=1 c07-h29-6018r : ok=85 changed=13 unreachable=0 failed=0 c07-h30-6018r : ok=83 changed=11 unreachable=0 failed=0 INSTALLER STATUS *************************************************************** Install Ceph Monitor : Complete (0:04:40) Install Ceph Manager : Complete (0:03:29) Install Ceph OSD : In Progress (0:03:59) This phase can be restarted by running: roles/ceph-osd/tasks/main.yml Wednesday 28 November 2018 18:20:43 +0000 (0:00:09.903) 0:15:29.446 **** =============================================================================== <...SNIP...> Created attachment 1509615 [details]
ansible -vv using ceph-ansibleRC4 - FAILED
> An exception occurred during task execution. To see the full traceback, use
> -vvv. The error was: [line 16]: u' # non_hci_safety_factor is the safety factor for dedicated nodes\n'
This looks like it is unrelated to ceph-volume? The ceph configuration is failing to be generated.
What is "non_hci_safety_factor" ? Maybe Sebastien can help here
I think they were trying to conditionalize how much memory was reserved for Bluestore OSDs (i.e. OSD caching layer), depending on whether or not the Ceph OSD host had to also run other things (example: hyperconverged OpenStack). Since you are running dedicated OSD hosts, you need the non_hci_safety_factor parameter. @John, I suspect that Sebastien might want to see the traceback that is hidden from the log output at the verbose levels used. Can you re-run the ansible-playbook command with the -vvv flag and paste the full task failure? Hopefully that will have the traceback, and enough information to see what is going on there. reran using -vvv flag and here is the full task failure, repeated for each of the OSD nodes... MSG: Unexpected failure during module execution. The full traceback is: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ansible/executor/task_executor.py", line 139, in run res = self._execute() File "/usr/lib/python2.7/site-packages/ansible/executor/task_executor.py", line 584, in _execute result = self._handler.run(task_vars=variables) File "/usr/share/ceph-ansible/plugins/actions/config_template.py", line 641, in run default_section=_vars.get('default_section', 'DEFAULT') File "/usr/share/ceph-ansible/plugins/actions/config_template.py", line 330, in return_config_overrides_ini config.readfp(config_object) File "/usr/lib64/python2.7/ConfigParser.py", line 324, in readfp self._read(fp, filename) File "/usr/share/ceph-ansible/plugins/actions/config_template.py", line 289, in _read raise e ParsingError: File contains parsing errors: <???> [line 16]: u' # non_hci_safety_factor is the safety factor for dedicated nodes\n' fatal: [c07-h25-6048r]: FAILED! => {} MSG: Unexpected failure during module execution. Created attachment 1509646 [details]
ansible runlog using -vvv
John, which version of Ansible are you using? I just tried enabling this, and I'm not able to reproduce your issue. I suspect, it's a matter of Ansible version, I'm running 2.7.2. # ansible --version ansible 2.6.7 config file = /usr/share/ceph-ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Sep 12 2018, 05:31:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] What ansible version is required for RHCS 3.2 ? > What ansible version is required for RHCS 3.2 ? Ansible 2.6 (documented in bug 1613941) The same failure has been reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1654441, let's change BZ for this conversation. I'm moving this one to POST again, as the original issue has been solved. Verified with ceph-ansible 3.2.0-0.1.rc8.el7cp. Ceph-ansible deploys cluster two NVMe devices successfully. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0020 |
Created attachment 1506222 [details] ceph-ansible runlog Description of problem: ceph-ansible deployment fails when specifying multiple NVMe devices Version-Release number of selected component (if applicable): RHEL 7.6 Ceph Version: 12.2.8-34.el7cp Ceph Ansible Version: ceph-ansible-3.2.0-0.1.rc1.el7cp.noarch ceph-volume Version: 1.0.0 Steps to Reproduce: 1. Running on Supermicro 6048r systems with 36x HDDs and two NVMe devices 2. Specified this in osds.yml (limited cfg to 4x HDDs and 2x NVMe) osd_objectstore: bluestore # use 'ceph-volume lvm batch' mode osd_scenario: lvm devices: - /dev/sdc - /dev/sdd - /dev/nvme0n1 - /dev/sdq - /dev/sdr - /dev/nvme1n1 3. # ansible-playbook site.yml 2>&1 | tee -a Deploy2nvme.Nov15 Actual results: ceph-ansible fails with... (see complete in attached logfile) TASK [ceph-config : run 'ceph-volume lvm batch --report' to see how many osds are to be created] *** Thursday 15 November 2018 19:11:38 +0000 (0:00:01.628) 0:34:45.946 ***** fatal: [c07-h01-6048r]: FAILED! => {"changed": true, "cmd": ["ceph-volume", "--cluster", "ceph", "lvm", "batch", "--bluestore", "--yes", "/dev/sdc", "/dev/sdd", "/dev/nvme0n1", "/dev/sdq", "/dev/sdr", "/dev/nvme1n1", "--report", "--format=json"], "msg": "non-zero return code", "rc": 1, "stderr": "", "stderr_lines": [], "stdout": "--> Aborting because strategy changed from bluestore.MixedType to bluestore.SingleType after filtering", "stdout_lines": ["--> Aborting because strategy changed from bluestore.MixedType to bluestore.SingleType after filtering"]} Expected results: ceph-ansible deploys cluster with two HDDs paired to each of the two NVMe devices Additional info: <-- from one of the OSD systems after ceph-ansible run # ssh c07-h01-6048r ceph-volume lvm list ====== osd.0 ======= [block] /dev/ceph-block-1df28d7e-c3d9-47e6-9d30-71ff1ec22128/osd-block-6d15f9de-ef13-4eb6-8a4e-d39366072bd9 type block osd id 0 cluster fsid a0b25557-9b93-48bc-b23d-7b6ae75c46eb cluster name ceph osd fsid 96d9a471-de1b-44ed-9cf0-7dc1c688edf9 db device /dev/ceph-block-dbs-f9277f5e-9b73-4e41-805e-b9c07d09e594/osd-block-db-665bfc0b-3c5d-4167-a7a1-1915dcdb625b encrypted 0 db uuid 6exXcU-vBo1-AeHs-JBGY-a5Uz-v3Q7-2sQc2Y cephx lockbox secret block uuid UyafDc-Y7UQ-WCCL-sD7O-8S5B-CI9w-7NjLVz block device /dev/ceph-block-1df28d7e-c3d9-47e6-9d30-71ff1ec22128/osd-block-6d15f9de-ef13-4eb6-8a4e-d39366072bd9 vdo 0 crush device class None devices /dev/sdc [ db] /dev/ceph-block-dbs-f9277f5e-9b73-4e41-805e-b9c07d09e594/osd-block-db-665bfc0b-3c5d-4167-a7a1-1915dcdb625b type db osd id 0 cluster fsid a0b25557-9b93-48bc-b23d-7b6ae75c46eb cluster name ceph osd fsid 96d9a471-de1b-44ed-9cf0-7dc1c688edf9 db device /dev/ceph-block-dbs-f9277f5e-9b73-4e41-805e-b9c07d09e594/osd-block-db-665bfc0b-3c5d-4167-a7a1-1915dcdb625b encrypted 0 db uuid 6exXcU-vBo1-AeHs-JBGY-a5Uz-v3Q7-2sQc2Y cephx lockbox secret block uuid UyafDc-Y7UQ-WCCL-sD7O-8S5B-CI9w-7NjLVz block device /dev/ceph-block-1df28d7e-c3d9-47e6-9d30-71ff1ec22128/osd-block-6d15f9de-ef13-4eb6-8a4e-d39366072bd9 vdo 0 crush device class None devices /dev/nvme0n1 ====== osd.26 ====== [block] /dev/ceph-block-1108ff83-a82a-466a-94f5-7b51eb6061e7/osd-block-0fa15758-5870-4df3-8d24-237673c995e6 type block osd id 26 cluster fsid a0b25557-9b93-48bc-b23d-7b6ae75c46eb cluster name ceph osd fsid 7840a23f-9829-4c7e-a401-c38da530ab8b db device /dev/ceph-block-dbs-f9277f5e-9b73-4e41-805e-b9c07d09e594/osd-block-db-f4cdcf01-a6c5-4620-b096-7f2d8d1afd12 encrypted 0 db uuid 5MyXhu-V3sj-B3CR-1Pes-VWxM-jGbu-w8APQD cephx lockbox secret block uuid hww6LU-KT3L-eSiK-iSAC-xxDt-Y5Js-6s1v3c block device /dev/ceph-block-1108ff83-a82a-466a-94f5-7b51eb6061e7/osd-block-0fa15758-5870-4df3-8d24-237673c995e6 vdo 0 crush device class None devices /dev/sdq [ db] /dev/ceph-block-dbs-f9277f5e-9b73-4e41-805e-b9c07d09e594/osd-block-db-f4cdcf01-a6c5-4620-b096-7f2d8d1afd12 type db osd id 26 cluster fsid a0b25557-9b93-48bc-b23d-7b6ae75c46eb cluster name ceph osd fsid 7840a23f-9829-4c7e-a401-c38da530ab8b db device /dev/ceph-block-dbs-f9277f5e-9b73-4e41-805e-b9c07d09e594/osd-block-db-f4cdcf01-a6c5-4620-b096-7f2d8d1afd12 encrypted 0 db uuid 5MyXhu-V3sj-B3CR-1Pes-VWxM-jGbu-w8APQD cephx lockbox secret block uuid hww6LU-KT3L-eSiK-iSAC-xxDt-Y5Js-6s1v3c block device /dev/ceph-block-1108ff83-a82a-466a-94f5-7b51eb6061e7/osd-block-0fa15758-5870-4df3-8d24-237673c995e6 vdo 0 crush device class None devices /dev/nvme1n1 ====== osd.12 ====== [block] /dev/ceph-block-69ac31a2-65e2-40f2-84d9-0f00720e03c9/osd-block-618522db-46b0-4b24-aec5-cc5cee180210 type block osd id 12 cluster fsid a0b25557-9b93-48bc-b23d-7b6ae75c46eb cluster name ceph osd fsid 0cfdf0cd-854f-4a81-b433-b7b7b5b164dd db device /dev/ceph-block-dbs-f9277f5e-9b73-4e41-805e-b9c07d09e594/osd-block-db-d2f397c7-3d0a-4a19-bac9-bb23164a6b5c encrypted 0 db uuid 1tejrb-6XpV-0XIO-ybMW-wTEs-ejmL-b23R2d cephx lockbox secret block uuid UPIHR7-125E-L9lG-1501-GY5s-eZnU-cZRk6N block device /dev/ceph-block-69ac31a2-65e2-40f2-84d9-0f00720e03c9/osd-block-618522db-46b0-4b24-aec5-cc5cee180210 vdo 0 crush device class None devices /dev/sdd [ db] /dev/ceph-block-dbs-f9277f5e-9b73-4e41-805e-b9c07d09e594/osd-block-db-d2f397c7-3d0a-4a19-bac9-bb23164a6b5c type db osd id 12 cluster fsid a0b25557-9b93-48bc-b23d-7b6ae75c46eb cluster name ceph osd fsid 0cfdf0cd-854f-4a81-b433-b7b7b5b164dd db device /dev/ceph-block-dbs-f9277f5e-9b73-4e41-805e-b9c07d09e594/osd-block-db-d2f397c7-3d0a-4a19-bac9-bb23164a6b5c encrypted 0 db uuid 1tejrb-6XpV-0XIO-ybMW-wTEs-ejmL-b23R2d cephx lockbox secret block uuid UPIHR7-125E-L9lG-1501-GY5s-eZnU-cZRk6N block device /dev/ceph-block-69ac31a2-65e2-40f2-84d9-0f00720e03c9/osd-block-618522db-46b0-4b24-aec5-cc5cee180210 vdo 0 crush device class None devices /dev/nvme0n1 ====== osd.37 ====== [block] /dev/ceph-block-015114af-dc99-472f-8a11-5abe40fa780e/osd-block-c70b2ad9-3101-491a-a2e0-a4ac45c4bad0 type block osd id 37 cluster fsid a0b25557-9b93-48bc-b23d-7b6ae75c46eb cluster name ceph osd fsid 87c7553e-d27c-4e77-a1f2-bb284bbc18ce db device /dev/ceph-block-dbs-f9277f5e-9b73-4e41-805e-b9c07d09e594/osd-block-db-ef348a39-2d05-4b96-9b00-9f32b0053d20 encrypted 0 db uuid RJEKem-BtyO-hz75-nu22-pCFI-XESq-KkoFwI cephx lockbox secret block uuid IdeGHL-ZhPl-uoqu-m3oB-A9l8-IO2W-2ovKok block device /dev/ceph-block-015114af-dc99-472f-8a11-5abe40fa780e/osd-block-c70b2ad9-3101-491a-a2e0-a4ac45c4bad0 vdo 0 crush device class None devices /dev/sdr [ db] /dev/ceph-block-dbs-f9277f5e-9b73-4e41-805e-b9c07d09e594/osd-block-db-ef348a39-2d05-4b96-9b00-9f32b0053d20 type db osd id 37 cluster fsid a0b25557-9b93-48bc-b23d-7b6ae75c46eb cluster name ceph osd fsid 87c7553e-d27c-4e77-a1f2-bb284bbc18ce db device /dev/ceph-block-dbs-f9277f5e-9b73-4e41-805e-b9c07d09e594/osd-block-db-ef348a39-2d05-4b96-9b00-9f32b0053d20 encrypted 0 db uuid RJEKem-BtyO-hz75-nu22-pCFI-XESq-KkoFwI cephx lockbox secret block uuid IdeGHL-ZhPl-uoqu-m3oB-A9l8-IO2W-2ovKok block device /dev/ceph-block-015114af-dc99-472f-8a11-5abe40fa780e/osd-block-c70b2ad9-3101-491a-a2e0-a4ac45c4bad0 vdo 0 crush device class None devices /dev/nvme1n1 # ssh c07-h01-6048r lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdc 8:32 0 1.8T 0 disk └─ceph--block--1df28d7e--c3d9--47e6--9d30--71ff1ec22128-osd--block--6d15f9de--ef13--4eb6--8a4e--d39366072bd9 253:0 0 1.8T 0 lvm sdd 8:48 0 1.8T 0 disk └─ceph--block--69ac31a2--65e2--40f2--84d9--0f00720e03c9-osd--block--618522db--46b0--4b24--aec5--cc5cee180210 253:2 0 1.8T 0 lvm sdq 65:0 0 1.8T 0 disk └─ceph--block--1108ff83--a82a--466a--94f5--7b51eb6061e7-osd--block--0fa15758--5870--4df3--8d24--237673c995e6 253:4 0 1.8T 0 lvm sdr 65:16 0 1.8T 0 disk └─ceph--block--015114af--dc99--472f--8a11--5abe40fa780e-osd--block--c70b2ad9--3101--491a--a2e0--a4ac45c4bad0 253:6 0 1.8T 0 lvm nvme0n1 259:1 0 745.2G 0 disk ├─ceph--block--dbs--f9277f5e--9b73--4e41--805e--b9c07d09e594-osd--block--db--665bfc0b--3c5d--4167--a7a1--1915dcdb625b 253:1 0 372G 0 lvm └─ceph--block--dbs--f9277f5e--9b73--4e41--805e--b9c07d09e594-osd--block--db--d2f397c7--3d0a--4a19--bac9--bb23164a6b5c 253:3 0 372G 0 lvm nvme1n1 259:0 0 745.2G 0 disk ├─ceph--block--dbs--f9277f5e--9b73--4e41--805e--b9c07d09e594-osd--block--db--f4cdcf01--a6c5--4620--b096--7f2d8d1afd12 253:5 0 372G 0 lvm └─ceph--block--dbs--f9277f5e--9b73--4e41--805e--b9c07d09e594-osd--block--db--ef348a39--2d05--4b96--9b00--9f32b0053d20 253:7 0 372G 0 lvm