Description of problem: In a Ceph HCI cluster with similar block devices (type - SSD and model - SSDSC2KG480G8R), for some reason volume groups dedicated to block.db were created on one of the HCI compute nodes. As far as I know (and managed to deploy successfully on previous composes with same hardware and topology), if there are no mix between slower/faster disks (i.e HDDs/SSDs) there should not be a dedicated device to store block.db. This doesn't appear to be the case in my setup, I'm not sure if this is a misconfiguration, regression (although I don't think so because we have managed to deploy Ceph HCI cluster on our CI using this compose on a 'simpler' topology) or and edge scenario I've hit. Ceph configuration: parameter_defaults: CephPoolDefaultSize: 2 CephPoolDefaultPgNum: 64 CephPools: - {"name": backups, "pg_num": 32, "pgp_num": 32, "application": "rbd"} - {"name": volumes, "pg_num": 512, "pgp_num": 512, "application": "rbd"} - {"name": vms, "pg_num": 128, "pgp_num": 128, "application": "rbd"} - {"name": images, "pg_num": 64, "pgp_num": 64, "application": "rbd"} CephConfigOverrides: osd_recovery_op_priority: 3 osd_recovery_max_active: 3 osd_max_backfills: 1 CephAnsibleExtraConfig: nb_retry_wait_osd_up: 60 delay_wait_osd_up: 20 is_hci: true # 6 OSDs * 2 vCPUs per non nvme SSD = 12 vCPUs (list below not used for VNF) # vCPUs from NUMA node 1 will be assigned to Ceph OSD ceph_osd_docker_cpuset_cpus: "5,7,9,11,13,15,17,19,23,25,27,29" # cpu_limit 0 means no limit as we are limiting CPUs with cpuset above ceph_osd_docker_cpu_limit: 0 # numactl preferred to cross the numa boundary if we have to # but try to only use memory from numa node0 # cpuset-mems would not let it cross numa boundary # lots of memory so NUMA boundary crossing unlikely ceph_osd_numactl_opts: "-N 1 --preferred=1" CephAnsibleDisksConfig: # 2 OSD per SSD osds_per_device: 2 osd_scenario: lvm osd_objectstore: bluestore devices: - /dev/sdb - /dev/sdc - /dev/sdd lsblk output from properly provisioned node (computehciovndpdksriov-0): NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 894.3G 0 disk |-sda1 8:1 0 1M 0 part `-sda2 8:2 0 894.3G 0 part / sdb 8:16 0 447.1G 0 disk |-ceph--ba2e0109--896a--4e31--ae8e--3625d879eafc-osd--data--e2839cb7--2529--4917--b8cc--eb2a07e1a8e2 253:0 0 223.6G 0 lvm `-ceph--ba2e0109--896a--4e31--ae8e--3625d879eafc-osd--data--bd1f17fa--ad8e--4a25--8052--0251d7145b72 253:1 0 223.6G 0 lvm sdc 8:32 0 447.1G 0 disk |-ceph--a0d35eac--0b75--4ed8--b03a--45ad6a4aa694-osd--data--2e39b38b--2bf0--4a8b--b616--0d4c6001cbe1 253:2 0 223.6G 0 lvm `-ceph--a0d35eac--0b75--4ed8--b03a--45ad6a4aa694-osd--data--ec0168b8--50ad--423b--a85b--fba38b8eb4a6 253:3 0 223.6G 0 lvm sdd 8:48 0 447.1G 0 disk |-ceph--5d8407ba--ffdd--4880--b351--c5e3b2b94ea3-osd--data--063ee4d3--1db2--43b1--be82--4c2d3c447b1b 253:4 0 223.6G 0 lvm `-ceph--5d8407ba--ffdd--4880--b351--c5e3b2b94ea3-osd--data--00daf5d4--67d4--44d9--a322--656b8849563d 253:5 0 223.6G 0 lvm sr0 11:0 1 1024M 0 rom lsblk output from not properly provisioned node (computehciovndpdksriov-1): NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 894.3G 0 disk |-sda1 8:1 0 1M 0 part `-sda2 8:2 0 894.3G 0 part / sdb 8:16 0 447.1G 0 disk `-ceph--block--dbs--66661c38--56e6--4381--9d7e--373ea48d4e17-osd--block--db--af9811a4--67dd--425c--9930--5795ca35b41f 253:1 0 447.1G 0 lvm sdc 8:32 0 447.1G 0 disk |-ceph--block--053a8597--99f2--4518--816a--fb1a20cba79a-osd--block--eab71c9d--ce74--48e9--96ed--27c625131d76 253:0 0 223.6G 0 lvm `-ceph--block--053a8597--99f2--4518--816a--fb1a20cba79a-osd--block--241657ab--9736--4e5d--b3e7--2b1f88203627 253:2 0 223.6G 0 lvm sdd 8:48 0 447.1G 0 disk `-ceph--block--dbs--66661c38--56e6--4381--9d7e--373ea48d4e17-osd--block--db--d7e156a9--41b6--4af9--b5d8--da09ee53aa61 253:3 0 447.1G 0 lvm sr0 11:0 1 1024M 0 rom vgs output from properly provisioned node: VG Attr Ext #PV #LV #SN VSize VFree VG UUID VProfile #VMda VMdaFree VMdaSize #VMdaUse VG Tags ceph-5d8407ba-ffdd-4880-b351-c5e3b2b94ea3 wz--n- 4.00m 1 2 0 <447.13g 4.00m LAPsiR-Ez5K-3orI-zNzP-TdSt-8b37-1z1hVX 1 506.50k 1020.00k 1 ceph-a0d35eac-0b75-4ed8-b03a-45ad6a4aa694 wz--n- 4.00m 1 2 0 <447.13g 4.00m hFGJX1-ssAN-CO8d-j8aq-3mFT-IoGX-TXI37E 1 506.50k 1020.00k 1 ceph-ba2e0109-896a-4e31-ae8e-3625d879eafc wz--n- 4.00m 1 2 0 <447.13g 4.00m lNcPeu-FxHt-ffP0-xs7h-fz0o-tBP3-M4Ua30 1 506.50k 1020.00k 1 Reloading config files vgs output from not properly provisioned node: VG Attr Ext #PV #LV #SN VSize VFree VG UUID VProfile #VMda VMdaFree VMdaSize #VMdaUse VG Tags ceph-block-053a8597-99f2-4518-816a-fb1a20cba79a wz--n- 4.00m 1 2 0 <447.13g 4.00m 24YewP-knRa-WdaQ-w9im-v3wF-vzAc-c5iVLb 1 506.00k 1020.00k 1 ceph-block-dbs-66661c38-56e6-4381-9d7e-373ea48d4e17 wz--n- 4.00m 2 2 0 <894.26g 0 7GinLN-Sgy1-UQgd-CLsO-5JWZ-aj6X-uLh2kQ 2 506.00k 1020.00k 2 Reloading config files Topology consists of: 3 ControllerSriov nodes 2 ComputeHCIOvsDpdkSriov (custom role based on ComputeHCIOvsDpdk which has the required SR-IOV resources enabled, we have verified this role to work on the same setup that was successfully deployed in CI) 1 ComputeOvsDpdkSriov (Non Ceph HCI compute node, also the deployment is failing if we do not use this node) As mentioned earlier, we were able to deploy Ceph HCI on this topology above on earlier composes. Version-Release number of selected component (if applicable): compose: RHOS-16.1-RHEL-8-20201214.n.3 (also encountered in 16.1.2 - RHOS-16.1-RHEL-8-20201021.n.0 with same release of ceph-ansible) ceph-ansible version: ceph-ansible-4.0.31-1.el8cp.noarch (also encountered with ceph-ansible-4.0.25.2-1.el8cp.noarch) uname -a output on affected host: Linux computehciovndpdksriov-1 4.18.0-193.29.1.el8_2.x86_64 #1 SMP Thu Oct 22 10:09:53 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux How reproducible: 100% of the several attempts I've done Steps to Reproduce: 1. Attempt to deploy Ceph HCI deployment. Actual results: Deployment fails. Expected results: Deployment is successful. Additional info: Will attach sosreport logs in comment.
A version of ceph-volume which fixes this problem will be available in the fixed in version of bug 1878500. This bug is in ceph container 4-36 [1] which is based on 14.2.8 A new ceph container based on 14.2.11 will be released with Ceph 4.2 as tracked in bz 1878500. If you deploy your overcloud using the new container coming from bz 1878500 then you shouldn't have this issue. [1] https://catalog.redhat.com/software/containers/rhceph/rhceph-4-rhel8/5e39df7cd70cc54b02baf33f *** This bug has been marked as a duplicate of bug 1878500 ***