Description of problem: I've created ceph cluster from 4 OSD nodes, each node have 8 spare disks (/dev/vd{b..i} - vdb and vdc have virtually 100GB, vdd..vdi have virtually 1TB). Apart from the fact, that only small part of the expected OSDs were properly created (Bug 1333399 comment 12), there are another issues: "ceph journal" and "ceph data" partitions randomly distributed across the available disks, only one "ceph journal" partition per disk, no possibility to affect the distribution of "ceph journal" and "ceph data" partitions across the disks. Version-Release number of selected component (if applicable): ceph-ansible-1.0.5-5.el7scon.noarch ceph-base-10.2.0-1.el7cp.x86_64 ceph-common-10.2.0-1.el7cp.x86_64 ceph-installer-1.0.6-1.el7scon.noarch ceph-mon-10.2.0-1.el7cp.x86_64 ceph-osd-10.2.0-1.el7cp.x86_64 ceph-selinux-10.2.0-1.el7cp.x86_64 rhscon-agent-0.0.6-1.el7scon.noarch rhscon-ceph-0.0.13-1.el7scon.x86_64 rhscon-core-0.0.16-1.el7scon.x86_64 rhscon-ui-0.0.29-1.el7scon.noarch How reproducible: 100%s Steps to Reproduce: 1. Prepare few nodes for ceph cluster with higher number of disks (e.g. 8) for each. 2. Create Cluster accordingly to the documentation. 3. Check the created cluster, check disk use on each node. Actual results: a) "ceph journal" and "ceph data" partitions are randomly and absurdly distributed across the available disks b) "ceph journal" for each OSD consume whole disk (one disk contains only one "ceph journal" partition even when there is plenty of available space) c) Administrator is not able to chose which disk should be used for journal and which for OSD (and also which should be left untouched). For example in my deployment, on each node I have 2 smaller disks - 100GB each - designed for journal and 6 big disks - 1TB each - designed for ceph data. Expected results: a) "ceph journal" and "ceph data" partitions should be distributed more logically (particular implementation depends on point c)) b) Journal partitions for more OSDs should be created on one disk. c) Administrator would like to be able to affect this process - select or prioritize disks designed for journal (e.g. SSD disks), for data and maybe also skip some disks and leave them untouched. Additional info: # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT vda 253:0 0 20G 0 disk ├─vda1 253:1 0 2G 0 part └─vda2 253:2 0 18G 0 part / vdb 253:16 0 100G 0 disk └─vdb1 253:17 0 4G 0 part vdc 253:32 0 100G 0 disk └─vdc1 253:33 0 4G 0 part vdd 253:48 0 1T 0 disk └─vdd1 253:49 0 1024G 0 part /var/lib/ceph/osd/ceph-1 vde 253:64 0 1T 0 disk vdf 253:80 0 1T 0 disk └─vdf1 253:81 0 4G 0 part vdg 253:96 0 1T 0 disk └─vdg1 253:97 0 1024G 0 part /var/lib/ceph/osd/ceph-0 vdh 253:112 0 1T 0 disk └─vdh1 253:113 0 1024G 0 part /var/lib/ceph/osd/ceph-2 vdi 253:128 0 1T 0 disk └─vdi1 253:129 0 4G 0 part # blkid /dev/block/253:2: UUID="a8886be3-e35c-49fd-878a-c5bfd22aa9fe" TYPE="ext4" /dev/block/253:1: UUID="4424075e-8873-41c3-90d0-1285b6abdb52" TYPE="swap" /dev/vdd1: UUID="ef0796ff-d443-4571-970c-d478526eee83" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="e2da038c-864f-4180-b914-55ccb8e601e8" /dev/vdg1: UUID="16d0590b-dea8-4fd0-8f39-0ed6f81be97a" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="5d2c1ef0-0d3e-4d68-a1e3-c7491d44eeb3" /dev/vdh1: UUID="109ec020-6b64-4b87-b215-8fbefc416e80" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="5375ed41-4d1e-42ff-a5d7-1dc3a11e9c21" /dev/vdb1: PARTLABEL="ceph journal" PARTUUID="080e098e-7963-4691-a1e0-29c6e01c1166" /dev/vdc1: PARTLABEL="ceph journal" PARTUUID="742b83f4-85b5-48bd-a775-6dba9819c1ca" /dev/vde: PTTYPE="gpt" /dev/vdf1: PARTLABEL="ceph journal" PARTUUID="3334f85d-df37-4711-841e-89bc68aebe40" /dev/vdi1: PARTLABEL="ceph journal" PARTUUID="e196bd0b-4492-414d-9be7-196cd8e299ce"
Created attachment 1155276 [details] POST request data
(In reply to Daniel Horák from comment #2) > Created attachment 1155276 [details] > POST request data Create cluster POST request.
(In reply to Daniel Horák from comment #0) > Description of problem: > I've created ceph cluster from 4 OSD nodes, each node have 8 spare disks > (/dev/vd{b..i} - vdb and vdc have virtually 100GB, vdd..vdi have virtually > 1TB). > Apart from the fact, that only small part of the expected OSDs were > properly created (Bug 1333399 comment 12), there are another issues: "ceph > journal" and "ceph data" partitions randomly distributed across the > available disks, only one "ceph journal" partition per disk, no possibility > to affect the distribution of "ceph journal" and "ceph data" partitions > across the disks. > > Version-Release number of selected component (if applicable): > ceph-ansible-1.0.5-5.el7scon.noarch > ceph-base-10.2.0-1.el7cp.x86_64 > ceph-common-10.2.0-1.el7cp.x86_64 > ceph-installer-1.0.6-1.el7scon.noarch > ceph-mon-10.2.0-1.el7cp.x86_64 > ceph-osd-10.2.0-1.el7cp.x86_64 > ceph-selinux-10.2.0-1.el7cp.x86_64 > rhscon-agent-0.0.6-1.el7scon.noarch > rhscon-ceph-0.0.13-1.el7scon.x86_64 > rhscon-core-0.0.16-1.el7scon.x86_64 > rhscon-ui-0.0.29-1.el7scon.noarch > > How reproducible: > 100%s > > Steps to Reproduce: > 1. Prepare few nodes for ceph cluster with higher number of disks (e.g. 8) > for each. > 2. Create Cluster accordingly to the documentation. > 3. Check the created cluster, check disk use on each node. > > Actual results: > a) "ceph journal" and "ceph data" partitions are randomly and absurdly > distributed across the available disks The selection of journals is not random. The algorithm it follows is as below 1. If all the disks are rotational in nature, a disk can work a journal for only one disk. The set of disks are sorted on descending order of size and bigger disks starts using smaller ones as journal disks but one disk uses only one other smaller disk as journal. 2. If few disks are SSD and few rotational in nature, the SSD are given preference to be used as journal. Also one SSD can work as journal for a maximum of 6 disks (as long as space is available). In this case we start mapping disks to SSDs to be used as journal and we continue till either space is exhausted or number of mapped disks reaches 6. If one SSD is already reached this limit, we start using next SSD for journal mapping. 3. After above logic if still disks are left out, we do mapping among the left out disks. If rotational disk only left, mapping happens as per logic mentioned in step-1 4. If SSD only left out, mapping happens as per logic mentioned in step-2. One SSD can be used as journal for a maximum of 6 disks. > > b) "ceph journal" for each OSD consume whole disk (one disk contains only > one "ceph journal" partition even when there is plenty of available space) > In this case looks like your disks are rotational in nature and so one disk is used as journal for other disk. A rotational disk cannot be used as journal for more than one disk. > c) Administrator is not able to chose which disk should be used for journal > and which for OSD (and also which should be left untouched). > For example in my deployment, on each node I have 2 smaller disks - 100GB > each - designed for journal and 6 big disks - 1TB each - designed for ceph > data. > This was discussed during designing phase with UX and was decided that journal mapping should be done automatically intelligently and user need not intervene. Currently does not provide any way to select journal disks per disk basis. > > Expected results: > a) "ceph journal" and "ceph data" partitions should be distributed more > logically (particular implementation depends on point c)) > > b) Journal partitions for more OSDs should be created on one disk. > > c) Administrator would like to be able to affect this process - select or > prioritize disks designed for journal (e.g. SSD disks), for data and maybe > also skip some disks and leave them untouched. > > Additional info: > # lsblk > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > vda 253:0 0 20G 0 disk > ├─vda1 253:1 0 2G 0 part > └─vda2 253:2 0 18G 0 part / > vdb 253:16 0 100G 0 disk > └─vdb1 253:17 0 4G 0 part > vdc 253:32 0 100G 0 disk > └─vdc1 253:33 0 4G 0 part > vdd 253:48 0 1T 0 disk > └─vdd1 253:49 0 1024G 0 part /var/lib/ceph/osd/ceph-1 > vde 253:64 0 1T 0 disk > vdf 253:80 0 1T 0 disk > └─vdf1 253:81 0 4G 0 part > vdg 253:96 0 1T 0 disk > └─vdg1 253:97 0 1024G 0 part /var/lib/ceph/osd/ceph-0 > vdh 253:112 0 1T 0 disk > └─vdh1 253:113 0 1024G 0 part /var/lib/ceph/osd/ceph-2 > vdi 253:128 0 1T 0 disk > └─vdi1 253:129 0 4G 0 part > > # blkid > /dev/block/253:2: UUID="a8886be3-e35c-49fd-878a-c5bfd22aa9fe" TYPE="ext4" > /dev/block/253:1: UUID="4424075e-8873-41c3-90d0-1285b6abdb52" TYPE="swap" > /dev/vdd1: UUID="ef0796ff-d443-4571-970c-d478526eee83" TYPE="xfs" > PARTLABEL="ceph data" PARTUUID="e2da038c-864f-4180-b914-55ccb8e601e8" > /dev/vdg1: UUID="16d0590b-dea8-4fd0-8f39-0ed6f81be97a" TYPE="xfs" > PARTLABEL="ceph data" PARTUUID="5d2c1ef0-0d3e-4d68-a1e3-c7491d44eeb3" > /dev/vdh1: UUID="109ec020-6b64-4b87-b215-8fbefc416e80" TYPE="xfs" > PARTLABEL="ceph data" PARTUUID="5375ed41-4d1e-42ff-a5d7-1dc3a11e9c21" > /dev/vdb1: PARTLABEL="ceph journal" > PARTUUID="080e098e-7963-4691-a1e0-29c6e01c1166" > /dev/vdc1: PARTLABEL="ceph journal" > PARTUUID="742b83f4-85b5-48bd-a775-6dba9819c1ca" > /dev/vde: PTTYPE="gpt" > /dev/vdf1: PARTLABEL="ceph journal" > PARTUUID="3334f85d-df37-4711-841e-89bc68aebe40" > /dev/vdi1: PARTLABEL="ceph journal" > PARTUUID="e196bd0b-4492-414d-9be7-196cd8e299ce"
Shubhendu, thanks for the clarification! I was slightly confused because the assignment seemed differently on each node, but it was probably because of other issues (Bug 1333399). I will try to simulate it on nodes with SSD and check the behaviour. Also I think that this process should be described in documentation.
Agree. This should be added in documentation.
I have another question related to this, why are the disks for creating OSDs selected "randomly" and not "first things first"? On following example is the disks schema for four nodes (each node have the same set of disks - vdb-vdj, vdb and vdc have 100G, the remaining have 1T and also /dev/vdb on all nodes behave as SSD disk). Why are not the OSDs sorted? - for example on first node: why osd.0 is not on vdc, osd.1 on vdd and so on? And probably related question, why are not the "unused" disks consistently the last disks on each node? (on node1 it is correctly /dev/vdj, but on node2 it is /dev/vdg, on node3 it is /dev/vde and on node4 it is /dev/vde and /dev/vdg) And another question, why the SSD disk (/dev/vdb) work as journal sometime for 5 OSD disks (node1,2,3) and sometime for 6 disks (node4)? And why for example on node1 is the journal for /dev/vdd1 on /dev/vdh and not on /dev/vdb, while /dev/vdb have 100G and /dev/vdh 1T? On node2 it is correct, /dev/vdc is journal, as it is the smallest disk from the remaining disks there, but it seems like it is just coincidence, because on node3 it is also not on /dev/vdc. [node1]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT vda 253:0 0 20G 0 disk ├─vda1 253:1 0 2G 0 part └─vda2 253:2 0 18G 0 part / vdb 253:16 0 100G 0 disk ├─vdb1 253:17 0 5G 0 part ├─vdb2 253:18 0 5G 0 part ├─vdb3 253:19 0 5G 0 part ├─vdb4 253:20 0 5G 0 part └─vdb5 253:21 0 5G 0 part vdc 253:32 0 100G 0 disk └─vdc1 253:33 0 100G 0 part /var/lib/ceph/osd/TestClusterA-2 vdd 253:48 0 1T 0 disk └─vdd1 253:49 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-4 vde 253:64 0 1T 0 disk └─vde1 253:65 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-0 vdf 253:80 0 1T 0 disk └─vdf1 253:81 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-3 vdg 253:96 0 1T 0 disk └─vdg1 253:97 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-5 vdh 253:112 0 1T 0 disk └─vdh1 253:113 0 5G 0 part vdi 253:128 0 1T 0 disk └─vdi1 253:129 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-1 vdj 253:144 0 1T 0 disk [node1]# ceph-disk list /dev/vda : /dev/vda1 other, swap /dev/vda2 other, xfs, mounted on / /dev/vdb : /dev/vdb3 ceph journal, for /dev/vdc1 /dev/vdb1 ceph journal, for /dev/vde1 /dev/vdb4 ceph journal, for /dev/vdf1 /dev/vdb5 ceph journal, for /dev/vdg1 /dev/vdb2 ceph journal, for /dev/vdi1 /dev/vdc : /dev/vdc1 ceph data, active, cluster TestClusterA, osd.2, journal /dev/vdb3 /dev/vdd : /dev/vdd1 ceph data, active, cluster TestClusterA, osd.4, journal /dev/vdh1 /dev/vde : /dev/vde1 ceph data, active, cluster TestClusterA, osd.0, journal /dev/vdb1 /dev/vdf : /dev/vdf1 ceph data, active, cluster TestClusterA, osd.3, journal /dev/vdb4 /dev/vdg : /dev/vdg1 ceph data, active, cluster TestClusterA, osd.5, journal /dev/vdb5 /dev/vdh : /dev/vdh1 ceph journal, for /dev/vdd1 /dev/vdi : /dev/vdi1 ceph data, active, cluster TestClusterA, osd.1, journal /dev/vdb2 /dev/vdj other, unknown ----------------------------------------------- [node2]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT vda 253:0 0 20G 0 disk ├─vda1 253:1 0 2G 0 part └─vda2 253:2 0 18G 0 part / vdb 253:16 0 100G 0 disk ├─vdb1 253:17 0 5G 0 part ├─vdb2 253:18 0 5G 0 part ├─vdb3 253:19 0 5G 0 part ├─vdb4 253:20 0 5G 0 part └─vdb5 253:21 0 5G 0 part vdc 253:32 0 100G 0 disk └─vdc1 253:33 0 5G 0 part vdd 253:48 0 1T 0 disk └─vdd1 253:49 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-11 vde 253:64 0 1T 0 disk └─vde1 253:65 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-8 vdf 253:80 0 1T 0 disk └─vdf1 253:81 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-10 vdg 253:96 0 1T 0 disk vdh 253:112 0 1T 0 disk └─vdh1 253:113 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-7 vdi 253:128 0 1T 0 disk └─vdi1 253:129 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-9 vdj 253:144 0 1T 0 disk └─vdj1 253:145 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-6 [node2]# ceph-disk list /dev/vda : /dev/vda1 other, swap /dev/vda2 other, xfs, mounted on / /dev/vdb : /dev/vdb5 ceph journal, for /dev/vdd1 /dev/vdb3 ceph journal, for /dev/vde1 /dev/vdb4 ceph journal, for /dev/vdf1 /dev/vdb2 ceph journal, for /dev/vdh1 /dev/vdb1 ceph journal, for /dev/vdj1 /dev/vdc : /dev/vdc1 ceph journal, for /dev/vdi1 /dev/vdd : /dev/vdd1 ceph data, active, cluster TestClusterA, osd.11, journal /dev/vdb5 /dev/vde : /dev/vde1 ceph data, active, cluster TestClusterA, osd.8, journal /dev/vdb3 /dev/vdf : /dev/vdf1 ceph data, active, cluster TestClusterA, osd.10, journal /dev/vdb4 /dev/vdg other, unknown /dev/vdh : /dev/vdh1 ceph data, active, cluster TestClusterA, osd.7, journal /dev/vdb2 /dev/vdi : /dev/vdi1 ceph data, active, cluster TestClusterA, osd.9, journal /dev/vdc1 /dev/vdj : /dev/vdj1 ceph data, active, cluster TestClusterA, osd.6, journal /dev/vdb1 ----------------------------------------------- [node3]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT vda 253:0 0 20G 0 disk ├─vda1 253:1 0 2G 0 part └─vda2 253:2 0 18G 0 part / vdb 253:16 0 100G 0 disk ├─vdb1 253:17 0 5G 0 part ├─vdb2 253:18 0 5G 0 part ├─vdb3 253:19 0 5G 0 part ├─vdb4 253:20 0 5G 0 part └─vdb5 253:21 0 5G 0 part vdc 253:32 0 100G 0 disk └─vdc1 253:33 0 100G 0 part /var/lib/ceph/osd/TestClusterA-17 vdd 253:48 0 1T 0 disk └─vdd1 253:49 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-16 vde 253:64 0 1T 0 disk vdf 253:80 0 1T 0 disk └─vdf1 253:81 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-15 vdg 253:96 0 1T 0 disk └─vdg1 253:97 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-12 vdh 253:112 0 1T 0 disk └─vdh1 253:113 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-14 vdi 253:128 0 1T 0 disk └─vdi1 253:129 0 5G 0 part vdj 253:144 0 1T 0 disk └─vdj1 253:145 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-13 [node3]# ceph-disk list /dev/vda : /dev/vda1 other, swap /dev/vda2 other, xfs, mounted on / /dev/vdb : /dev/vdb5 ceph journal, for /dev/vdc1 /dev/vdb4 ceph journal, for /dev/vdd1 /dev/vdb3 ceph journal, for /dev/vdf1 /dev/vdb2 ceph journal, for /dev/vdh1 /dev/vdb1 ceph journal, for /dev/vdj1 /dev/vdc : /dev/vdc1 ceph data, active, cluster TestClusterA, osd.17, journal /dev/vdb5 /dev/vdd : /dev/vdd1 ceph data, active, cluster TestClusterA, osd.16, journal /dev/vdb4 /dev/vde other, unknown /dev/vdf : /dev/vdf1 ceph data, active, cluster TestClusterA, osd.15, journal /dev/vdb3 /dev/vdg : /dev/vdg1 ceph data, active, cluster TestClusterA, osd.12, journal /dev/vdi1 /dev/vdh : /dev/vdh1 ceph data, active, cluster TestClusterA, osd.14, journal /dev/vdb2 /dev/vdi : /dev/vdi1 ceph journal, for /dev/vdg1 /dev/vdj : /dev/vdj1 ceph data, active, cluster TestClusterA, osd.13, journal /dev/vdb1 ----------------------------------------------- [node4]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT vda 253:0 0 20G 0 disk ├─vda1 253:1 0 2G 0 part └─vda2 253:2 0 18G 0 part / vdb 253:16 0 100G 0 disk ├─vdb1 253:17 0 5G 0 part ├─vdb2 253:18 0 5G 0 part ├─vdb3 253:19 0 5G 0 part ├─vdb4 253:20 0 5G 0 part ├─vdb5 253:21 0 5G 0 part └─vdb6 253:22 0 5G 0 part vdc 253:32 0 100G 0 disk └─vdc1 253:33 0 100G 0 part /var/lib/ceph/osd/TestClusterA-21 vdd 253:48 0 1T 0 disk └─vdd1 253:49 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-22 vde 253:64 0 1T 0 disk vdf 253:80 0 1T 0 disk └─vdf1 253:81 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-19 vdg 253:96 0 1T 0 disk vdh 253:112 0 1T 0 disk └─vdh1 253:113 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-18 vdi 253:128 0 1T 0 disk └─vdi1 253:129 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-20 vdj 253:144 0 1T 0 disk └─vdj1 253:145 0 1024G 0 part /var/lib/ceph/osd/TestClusterA-23 [node4]# ceph-disk list /dev/vda : /dev/vda1 other, swap /dev/vda2 other, xfs, mounted on / /dev/vdb : /dev/vdb4 ceph journal, for /dev/vdc1 /dev/vdb5 ceph journal, for /dev/vdd1 /dev/vdb2 ceph journal, for /dev/vdf1 /dev/vdb1 ceph journal, for /dev/vdh1 /dev/vdb3 ceph journal, for /dev/vdi1 /dev/vdb6 ceph journal, for /dev/vdj1 /dev/vdc : /dev/vdc1 ceph data, active, cluster TestClusterA, osd.21, journal /dev/vdb4 /dev/vdd : /dev/vdd1 ceph data, active, cluster TestClusterA, osd.22, journal /dev/vdb5 /dev/vde other, unknown /dev/vdf : /dev/vdf1 ceph data, active, cluster TestClusterA, osd.19, journal /dev/vdb2 /dev/vdg other, unknown /dev/vdh : /dev/vdh1 ceph data, active, cluster TestClusterA, osd.18, journal /dev/vdb1 /dev/vdi : /dev/vdi1 ceph data, active, cluster TestClusterA, osd.20, journal /dev/vdb3 /dev/vdj : /dev/vdj1 ceph data, active, cluster TestClusterA, osd.23, journal /dev/vdb6
Daniel, So the logic goes like as below 1. Bigger sized disks are always given preferences to be used as data disk and smaller sized ones as journal disks 2. Mapping of data disks to their journal disk happens per node basis. So a disk cannot have its journal disk on another node 3. In case all the disks are rotational in nature, only one data disk can use another disk as journal disk 4. If SSDs are available, SSD are given preference to be used as journal disk (even if they are bigger in size, as SSDs can be used as journal disk for multiple data disks, to be specific upto 6 at the moment) 6. If there is a mix of rotational and SSD disks available on the node, first all the rotational disks start using SSDs as their journal. If SSDs exhaust serving as journal disk, rotational disk try using journal disk among themselves and smaller ones are selected as journal 7. If a condition arises that all rotational disks exhaust and still left with SSDs, SSDs try to use journals among themselves and within them smaller ones would be used as journal 8. For achieving this we always sort the rotational and SSDs disks on their descending order of size and them apply the above mentioned logic to reach to mapping of data disk to a journal disk Hope this clarifies...
Shubhendu, I think I quite understand the process from your comments (4 and 8). In previous comment I tried to point to things which seems not completely in keeping with the description. (In reply to Shubhendu Tripathi from comment #8) > So the logic goes like as below > > 1. Bigger sized disks are always given preferences to be used as data disk > and smaller sized ones as journal disks Why for example on node1 is the journal for /dev/vdd1 on /dev/vdh (size 1TB) and not on /dev/vdc (size 100GB)? (Sorry I made a typo in the question in previous comment and write vdb instead of vdc.) > 2. Mapping of data disks to their journal disk happens per node basis. So a > disk cannot have its journal disk on another node Sure, this I perfectly understand. > 3. In case all the disks are rotational in nature, only one data disk can > use another disk as journal disk This is also clear for me. > 4. If SSDs are available, SSD are given preference to be used as journal > disk (even if they are bigger in size, as SSDs can be used as journal disk > for multiple data disks, to be specific upto 6 at the moment) Why the SSD disks (/dev/vdb) work as journal sometimes for 5 OSD disks (node1,2,3) and sometime for 6 disks (node4)? > 6. If there is a mix of rotational and SSD disks available on the node, > first all the rotational disks start using SSDs as their journal. If SSDs > exhaust serving as journal disk, rotational disk try using journal disk > among themselves and smaller ones are selected as journal Same question as for note 1.: Why for example on node1 is the journal for /dev/vdd1 on /dev/vdh (size 1TB) and not on /dev/vdc (size 100GB)? > 7. If a condition arises that all rotational disks exhaust and still left > with SSDs, SSDs try to use journals among themselves and within them smaller > ones would be used as journal > 8. For achieving this we always sort the rotational and SSDs disks on their > descending order of size and them apply the above mentioned logic to reach > to mapping of data disk to a journal disk All the disks from vdd to vdj have the same size (1T), in that case I think it should be used in alphabetical order, because current "random/unsorted" state is very confusing for the administrator.
(In reply to Daniel Horák from comment #9) > Shubhendu, > > I think I quite understand the process from your comments (4 and 8). In > previous comment I tried to point to things which seems not completely in > keeping with the description. > > (In reply to Shubhendu Tripathi from comment #8) > > So the logic goes like as below > > > > 1. Bigger sized disks are always given preferences to be used as data disk > > and smaller sized ones as journal disks > > Why for example on node1 is the journal for /dev/vdd1 on /dev/vdh (size 1TB) > and not on /dev/vdc (size 100GB)? (Sorry I made a typo in the question in > previous comment and write vdb instead of vdc.) It depends upon what is type of disk. If its SSD it would always be used as journal disk, whatever size it is of. rotational disks would always try to use SSDs (if any) as journal first. If no SSDs at all, rotational disks use smaller ones as journal from descending sorted order of disks. > > > 2. Mapping of data disks to their journal disk happens per node basis. So a > > disk cannot have its journal disk on another node > > Sure, this I perfectly understand. > > > 3. In case all the disks are rotational in nature, only one data disk can > > use another disk as journal disk > > This is also clear for me. > > > 4. If SSDs are available, SSD are given preference to be used as journal > > disk (even if they are bigger in size, as SSDs can be used as journal disk > > for multiple data disks, to be specific upto 6 at the moment) > > Why the SSD disks (/dev/vdb) work as journal sometimes for 5 OSD disks > (node1,2,3) and sometime for 6 disks (node4)? Currently SSD can be used as journal for a maximum of 6 disks. There is a patch to make this default to 4. If space is available and rotational disks want to use an SSD as journal, a maximum of 4 would be able to use. Even if space is left after that, it wont be utilized. So effectively it depends on the available size of the SSD and how many rotational disks are available for data, to figure out how may journals would be paced on an SSD. > > > 6. If there is a mix of rotational and SSD disks available on the node, > > first all the rotational disks start using SSDs as their journal. If SSDs > > exhaust serving as journal disk, rotational disk try using journal disk > > among themselves and smaller ones are selected as journal > > Same question as for note 1.: Why for example on node1 is the journal for > /dev/vdd1 on /dev/vdh (size 1TB) and not on /dev/vdc (size 100GB)? > > > 7. If a condition arises that all rotational disks exhaust and still left > > with SSDs, SSDs try to use journals among themselves and within them smaller > > ones would be used as journal > > > 8. For achieving this we always sort the rotational and SSDs disks on their > > descending order of size and them apply the above mentioned logic to reach > > to mapping of data disk to a journal disk > > All the disks from vdd to vdj have the same size (1T), in that case I think > it should be used in alphabetical order, because current "random/unsorted" > state is very confusing for the administrator. If all the disks have same size, there is no way we use alphabatical order to do the mapping. All we do is use one as data and other as journal. Whatever order comes in sorted list we just follow through that. Using alphabatical names of the disks, I dont think is a good idea. As long as bigger disks are used as data and smaller ones as journal, it serves the purpose well.
(In reply to Shubhendu Tripathi from comment #10) > (In reply to Daniel Horák from comment #9) > > Shubhendu, > > > > I think I quite understand the process from your comments (4 and 8). In > > previous comment I tried to point to things which seems not completely in > > keeping with the description. > > > > (In reply to Shubhendu Tripathi from comment #8) > > > So the logic goes like as below > > > > > > 1. Bigger sized disks are always given preferences to be used as data disk > > > and smaller sized ones as journal disks > > > > Why for example on node1 is the journal for /dev/vdd1 on /dev/vdh (size 1TB) > > and not on /dev/vdc (size 100GB)? (Sorry I made a typo in the question in > > previous comment and write vdb instead of vdc.) > > It depends upon what is type of disk. If its SSD it would always be used as > journal disk, whatever size it is of. rotational disks would always try to > use SSDs (if any) as journal first. If no SSDs at all, rotational disks use > smaller ones as journal from descending sorted order of disks. Only /dev/vdb is "SSD" disk (to be correct, behaves as SSD disk), so neither vdh or vdc are SDDs. > > > > > 2. Mapping of data disks to their journal disk happens per node basis. So a > > > disk cannot have its journal disk on another node > > > > Sure, this I perfectly understand. > > > > > 3. In case all the disks are rotational in nature, only one data disk can > > > use another disk as journal disk > > > > This is also clear for me. > > > > > 4. If SSDs are available, SSD are given preference to be used as journal > > > disk (even if they are bigger in size, as SSDs can be used as journal disk > > > for multiple data disks, to be specific upto 6 at the moment) > > > > Why the SSD disks (/dev/vdb) work as journal sometimes for 5 OSD disks > > (node1,2,3) and sometime for 6 disks (node4)? > > Currently SSD can be used as journal for a maximum of 6 disks. There is a > patch to make this default to 4. If space is available and rotational disks > want to use an SSD as journal, a maximum of 4 would be able to use. Even if > space is left after that, it wont be utilized. > > So effectively it depends on the available size of the SSD and how many > rotational disks are available for data, to figure out how may journals > would be paced on an SSD. As you can see in the lsblk output in comment 7, the vdb device on all nodes have 100GB, so enough space for 6*5GB. > > > > > 6. If there is a mix of rotational and SSD disks available on the node, > > > first all the rotational disks start using SSDs as their journal. If SSDs > > > exhaust serving as journal disk, rotational disk try using journal disk > > > among themselves and smaller ones are selected as journal > > > > Same question as for note 1.: Why for example on node1 is the journal for > > /dev/vdd1 on /dev/vdh (size 1TB) and not on /dev/vdc (size 100GB)? > > > > > 7. If a condition arises that all rotational disks exhaust and still left > > > with SSDs, SSDs try to use journals among themselves and within them smaller > > > ones would be used as journal > > > > > 8. For achieving this we always sort the rotational and SSDs disks on their > > > descending order of size and them apply the above mentioned logic to reach > > > to mapping of data disk to a journal disk > > > > All the disks from vdd to vdj have the same size (1T), in that case I think > > it should be used in alphabetical order, because current "random/unsorted" > > state is very confusing for the administrator. > > If all the disks have same size, there is no way we use alphabatical order > to do the mapping. All we do is use one as data and other as journal. > Whatever order comes in sorted list we just follow through that. Using > alphabatical names of the disks, I dont think is a good idea. As long as > bigger disks are used as data and smaller ones as journal, it serves the > purpose well. I understand the sorting accordingly to the size. From my point of view it would be helpful and looks better to sort disks with the (nearly) same size alphabetically. But it is just cosmetics (nice-to-have) issue.
There is a bug around SSD being used as journal. Sent patch https://review.gerrithub.io/#/c/278720/ to resolve this.
With patches https://review.gerrithub.io/#/c/278720/ https://review.gerrithub.io/#/c/280447/ the journal mapping logic works as expected. Moving to MODIFIED state..
Tested on: USM Server (RHEL 7.2): ceph-ansible-1.0.5-31.el7scon.noarch ceph-installer-1.0.14-1.el7scon.noarch rhscon-ceph-0.0.38-1.el7scon.x86_64 rhscon-core-0.0.38-1.el7scon.x86_64 rhscon-core-selinux-0.0.38-1.el7scon.noarch rhscon-ui-0.0.51-1.el7scon.noarch Ceph MON (RHEL 7.2): calamari-server-1.4.7-1.el7cp.x86_64 ceph-base-10.2.2-32.el7cp.x86_64 ceph-common-10.2.2-32.el7cp.x86_64 ceph-mon-10.2.2-32.el7cp.x86_64 ceph-selinux-10.2.2-32.el7cp.x86_64 libcephfs1-10.2.2-32.el7cp.x86_64 python-cephfs-10.2.2-32.el7cp.x86_64 rhscon-agent-0.0.16-1.el7scon.noarch rhscon-core-selinux-0.0.38-1.el7scon.noarch Ceph OSD (RHEL 7.2): ceph-base-10.2.2-32.el7cp.x86_64 ceph-common-10.2.2-32.el7cp.x86_64 ceph-osd-10.2.2-32.el7cp.x86_64 ceph-selinux-10.2.2-32.el7cp.x86_64 libcephfs1-10.2.2-32.el7cp.x86_64 python-cephfs-10.2.2-32.el7cp.x86_64 rhscon-agent-0.0.16-1.el7scon.noarch rhscon-core-selinux-0.0.38-1.el7scon.noarch The algorithm works as described in this bug. I've created new RFE bug for more predictably distributed OSDs and journal across the same disks: Bug 1362431 - [RFE] distribute OSDs and journal more predictably across the available disks >> VERIFIED