1334344 – OSDs and journal "randomly" and absurdly distributed across available disks

Bug 1334344 - OSDs and journal "randomly" and absurdly distributed across available disks

Summary: OSDs and journal "randomly" and absurdly distributed across available disks

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	Ceph
Sub Component:
Version:	2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	2
Assignee:	Shubhendu Tripathi
QA Contact:	sds-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-09 12:21 UTC by Daniel Horák
Modified:	2018-11-19 05:30 UTC (History)
CC List:	3 users (show)
Fixed In Version:	rhscon-core-0.0.34-1.el7scon.x86_64 rhscon-ceph-0.0.33-1.el7scon.x86_64 rhscon-ui-0.0.47-1.el7scon.noarch
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-19 05:30:03 UTC
Embargoed:

Attachments	(Terms of Use)
POST request data (3.21 KB, text/plain) 2016-05-09 12:24 UTC, Daniel Horák	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1362431	0	unspecified	CLOSED	[RFE] distribute OSDs and journal more predictably across the available disks	2021-02-22 00:41:40 UTC

Internal Links: 1362431

Description Daniel Horák 2016-05-09 12:21:05 UTC

Description of problem:
  I've created ceph cluster from 4 OSD nodes, each node have 8 spare disks (/dev/vd{b..i} - vdb and vdc have virtually 100GB, vdd..vdi have virtually 1TB).
  Apart from the fact, that only small part of the expected OSDs were properly created (Bug 1333399 comment 12), there are another issues:  "ceph journal" and "ceph data" partitions randomly distributed across the available disks, only one "ceph journal" partition per disk, no possibility to affect the distribution of "ceph journal" and "ceph data" partitions across the disks.

Version-Release number of selected component (if applicable):
  ceph-ansible-1.0.5-5.el7scon.noarch
  ceph-base-10.2.0-1.el7cp.x86_64
  ceph-common-10.2.0-1.el7cp.x86_64
  ceph-installer-1.0.6-1.el7scon.noarch
  ceph-mon-10.2.0-1.el7cp.x86_64
  ceph-osd-10.2.0-1.el7cp.x86_64
  ceph-selinux-10.2.0-1.el7cp.x86_64
  rhscon-agent-0.0.6-1.el7scon.noarch
  rhscon-ceph-0.0.13-1.el7scon.x86_64
  rhscon-core-0.0.16-1.el7scon.x86_64
  rhscon-ui-0.0.29-1.el7scon.noarch

How reproducible:
  100%s

Steps to Reproduce:
1. Prepare few nodes for ceph cluster with higher number of disks (e.g. 8) for each.
2. Create Cluster accordingly to the documentation.
3. Check the created cluster, check disk use on each node.

Actual results:
a) "ceph journal" and "ceph data" partitions are randomly and absurdly distributed across the available disks

b) "ceph journal" for each OSD consume whole disk (one disk contains only one "ceph journal" partition even when there is plenty of available space)

c) Administrator is not able to chose which disk should be used for journal and which for OSD (and also which should be left untouched).
  For example in my deployment, on each node I have 2 smaller disks - 100GB each - designed for journal and 6 big disks - 1TB each - designed for ceph data.
  

Expected results:
a) "ceph journal" and "ceph data" partitions should be distributed more logically (particular implementation depends on point c))

b) Journal partitions for more OSDs should be created on one disk.

c) Administrator would like to be able to affect this process - select or prioritize disks designed for journal (e.g. SSD disks), for data and maybe also skip some disks and leave them untouched.

Additional info:
# lsblk 
 NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
 vda    253:0    0   20G  0 disk 
 ├─vda1 253:1    0    2G  0 part 
 └─vda2 253:2    0   18G  0 part /
 vdb    253:16   0  100G  0 disk 
 └─vdb1 253:17   0    4G  0 part 
 vdc    253:32   0  100G  0 disk 
 └─vdc1 253:33   0    4G  0 part 
 vdd    253:48   0    1T  0 disk 
 └─vdd1 253:49   0 1024G  0 part /var/lib/ceph/osd/ceph-1
 vde    253:64   0    1T  0 disk 
 vdf    253:80   0    1T  0 disk 
 └─vdf1 253:81   0    4G  0 part 
 vdg    253:96   0    1T  0 disk 
 └─vdg1 253:97   0 1024G  0 part /var/lib/ceph/osd/ceph-0
 vdh    253:112  0    1T  0 disk 
 └─vdh1 253:113  0 1024G  0 part /var/lib/ceph/osd/ceph-2
 vdi    253:128  0    1T  0 disk 
 └─vdi1 253:129  0    4G  0 part 

# blkid 
  /dev/block/253:2: UUID="a8886be3-e35c-49fd-878a-c5bfd22aa9fe" TYPE="ext4" 
  /dev/block/253:1: UUID="4424075e-8873-41c3-90d0-1285b6abdb52" TYPE="swap" 
  /dev/vdd1: UUID="ef0796ff-d443-4571-970c-d478526eee83" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="e2da038c-864f-4180-b914-55ccb8e601e8" 
  /dev/vdg1: UUID="16d0590b-dea8-4fd0-8f39-0ed6f81be97a" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="5d2c1ef0-0d3e-4d68-a1e3-c7491d44eeb3" 
  /dev/vdh1: UUID="109ec020-6b64-4b87-b215-8fbefc416e80" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="5375ed41-4d1e-42ff-a5d7-1dc3a11e9c21" 
  /dev/vdb1: PARTLABEL="ceph journal" PARTUUID="080e098e-7963-4691-a1e0-29c6e01c1166" 
  /dev/vdc1: PARTLABEL="ceph journal" PARTUUID="742b83f4-85b5-48bd-a775-6dba9819c1ca" 
  /dev/vde: PTTYPE="gpt" 
  /dev/vdf1: PARTLABEL="ceph journal" PARTUUID="3334f85d-df37-4711-841e-89bc68aebe40" 
  /dev/vdi1: PARTLABEL="ceph journal" PARTUUID="e196bd0b-4492-414d-9be7-196cd8e299ce"

Comment 2 Daniel Horák 2016-05-09 12:24:34 UTC

Created attachment 1155276 [details]
POST request data

Comment 3 Daniel Horák 2016-05-09 12:25:55 UTC

(In reply to Daniel Horák from comment #2)
> Created attachment 1155276 [details]
> POST request data

Create cluster POST request.

Comment 4 Shubhendu Tripathi 2016-05-10 04:11:50 UTC

(In reply to Daniel Horák from comment #0)
> Description of problem:
>   I've created ceph cluster from 4 OSD nodes, each node have 8 spare disks
> (/dev/vd{b..i} - vdb and vdc have virtually 100GB, vdd..vdi have virtually
> 1TB).
>   Apart from the fact, that only small part of the expected OSDs were
> properly created (Bug 1333399 comment 12), there are another issues:  "ceph
> journal" and "ceph data" partitions randomly distributed across the
> available disks, only one "ceph journal" partition per disk, no possibility
> to affect the distribution of "ceph journal" and "ceph data" partitions
> across the disks.
> 
> Version-Release number of selected component (if applicable):
>   ceph-ansible-1.0.5-5.el7scon.noarch
>   ceph-base-10.2.0-1.el7cp.x86_64
>   ceph-common-10.2.0-1.el7cp.x86_64
>   ceph-installer-1.0.6-1.el7scon.noarch
>   ceph-mon-10.2.0-1.el7cp.x86_64
>   ceph-osd-10.2.0-1.el7cp.x86_64
>   ceph-selinux-10.2.0-1.el7cp.x86_64
>   rhscon-agent-0.0.6-1.el7scon.noarch
>   rhscon-ceph-0.0.13-1.el7scon.x86_64
>   rhscon-core-0.0.16-1.el7scon.x86_64
>   rhscon-ui-0.0.29-1.el7scon.noarch
> 
> How reproducible:
>   100%s
> 
> Steps to Reproduce:
> 1. Prepare few nodes for ceph cluster with higher number of disks (e.g. 8)
> for each.
> 2. Create Cluster accordingly to the documentation.
> 3. Check the created cluster, check disk use on each node.
> 
> Actual results:
> a) "ceph journal" and "ceph data" partitions are randomly and absurdly
> distributed across the available disks

The selection of journals is not random. The algorithm it follows is as below

1. If all the disks are rotational in nature, a disk can work a journal for only one disk. The set of disks are sorted on descending order of size and bigger disks starts using smaller ones as journal disks but one disk uses only one other smaller disk as journal.
2. If few disks are SSD and few rotational in nature, the SSD are given preference to be used as journal. Also one SSD can work as journal for a maximum of 6 disks (as long as space is available). In this case we start mapping disks to SSDs to be used as journal and we continue till either space is exhausted or number of mapped disks reaches 6. If one SSD is already reached this limit, we start using next SSD for journal mapping.
3. After above logic if still disks are left out, we do mapping among the left out disks. If rotational disk only left, mapping happens as per logic mentioned in step-1
4. If SSD only left out, mapping happens as per logic mentioned in step-2. One SSD can be used as journal for a maximum of 6 disks.

> 
> b) "ceph journal" for each OSD consume whole disk (one disk contains only
> one "ceph journal" partition even when there is plenty of available space)
> 

In this case looks like your disks are rotational in nature and so one disk is used as journal for other disk. A rotational disk cannot be used as journal for more than one disk.

> c) Administrator is not able to chose which disk should be used for journal
> and which for OSD (and also which should be left untouched).
>   For example in my deployment, on each node I have 2 smaller disks - 100GB
> each - designed for journal and 6 big disks - 1TB each - designed for ceph
> data.
>   

This was discussed during designing phase with UX and was decided that journal mapping should be done automatically intelligently and user need not intervene. Currently does not provide any way to select journal disks per disk basis.

> 
> Expected results:
> a) "ceph journal" and "ceph data" partitions should be distributed more
> logically (particular implementation depends on point c))
> 
> b) Journal partitions for more OSDs should be created on one disk.
> 
> c) Administrator would like to be able to affect this process - select or
> prioritize disks designed for journal (e.g. SSD disks), for data and maybe
> also skip some disks and leave them untouched.
> 
> Additional info:
> # lsblk 
>  NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
>  vda    253:0    0   20G  0 disk 
>  ├─vda1 253:1    0    2G  0 part 
>  └─vda2 253:2    0   18G  0 part /
>  vdb    253:16   0  100G  0 disk 
>  └─vdb1 253:17   0    4G  0 part 
>  vdc    253:32   0  100G  0 disk 
>  └─vdc1 253:33   0    4G  0 part 
>  vdd    253:48   0    1T  0 disk 
>  └─vdd1 253:49   0 1024G  0 part /var/lib/ceph/osd/ceph-1
>  vde    253:64   0    1T  0 disk 
>  vdf    253:80   0    1T  0 disk 
>  └─vdf1 253:81   0    4G  0 part 
>  vdg    253:96   0    1T  0 disk 
>  └─vdg1 253:97   0 1024G  0 part /var/lib/ceph/osd/ceph-0
>  vdh    253:112  0    1T  0 disk 
>  └─vdh1 253:113  0 1024G  0 part /var/lib/ceph/osd/ceph-2
>  vdi    253:128  0    1T  0 disk 
>  └─vdi1 253:129  0    4G  0 part 
> 
> # blkid 
>   /dev/block/253:2: UUID="a8886be3-e35c-49fd-878a-c5bfd22aa9fe" TYPE="ext4" 
>   /dev/block/253:1: UUID="4424075e-8873-41c3-90d0-1285b6abdb52" TYPE="swap" 
>   /dev/vdd1: UUID="ef0796ff-d443-4571-970c-d478526eee83" TYPE="xfs"
> PARTLABEL="ceph data" PARTUUID="e2da038c-864f-4180-b914-55ccb8e601e8" 
>   /dev/vdg1: UUID="16d0590b-dea8-4fd0-8f39-0ed6f81be97a" TYPE="xfs"
> PARTLABEL="ceph data" PARTUUID="5d2c1ef0-0d3e-4d68-a1e3-c7491d44eeb3" 
>   /dev/vdh1: UUID="109ec020-6b64-4b87-b215-8fbefc416e80" TYPE="xfs"
> PARTLABEL="ceph data" PARTUUID="5375ed41-4d1e-42ff-a5d7-1dc3a11e9c21" 
>   /dev/vdb1: PARTLABEL="ceph journal"
> PARTUUID="080e098e-7963-4691-a1e0-29c6e01c1166" 
>   /dev/vdc1: PARTLABEL="ceph journal"
> PARTUUID="742b83f4-85b5-48bd-a775-6dba9819c1ca" 
>   /dev/vde: PTTYPE="gpt" 
>   /dev/vdf1: PARTLABEL="ceph journal"
> PARTUUID="3334f85d-df37-4711-841e-89bc68aebe40" 
>   /dev/vdi1: PARTLABEL="ceph journal"
> PARTUUID="e196bd0b-4492-414d-9be7-196cd8e299ce"

Comment 5 Daniel Horák 2016-05-10 11:46:53 UTC

Shubhendu, thanks for the clarification!
I was slightly confused because the assignment seemed differently on each node, but it was probably because of other issues (Bug 1333399).

I will try to simulate it on nodes with SSD and check the behaviour.

Also I think that this process should be described in documentation.

Comment 6 Shubhendu Tripathi 2016-05-10 12:01:49 UTC

Agree. This should be added in documentation.

Comment 7 Daniel Horák 2016-05-31 09:01:54 UTC

I have another question related to this, why are the disks for creating OSDs selected "randomly" and not "first things first"?

On following example is the disks schema for four nodes (each node have the same set of disks - vdb-vdj, vdb and vdc have 100G, the remaining have 1T and also /dev/vdb on all nodes behave as SSD disk).

Why are not the OSDs sorted? - for example on first node: why osd.0 is not on vdc, osd.1 on vdd and so on?
And probably related question, why are not the "unused" disks consistently the last disks on each node? (on node1 it is correctly /dev/vdj, but on node2 it is /dev/vdg, on node3 it is /dev/vde and on node4 it is /dev/vde and /dev/vdg)

And another question, why the SSD disk (/dev/vdb) work as journal sometime for 5 OSD disks (node1,2,3) and sometime for 6 disks (node4)?

And why for example on node1 is the journal for /dev/vdd1 on /dev/vdh and not on /dev/vdb, while /dev/vdb have 100G and /dev/vdh 1T?
On node2 it is correct, /dev/vdc is journal, as it is the smallest disk from the remaining disks there, but it seems like it is just coincidence, because on node3 it is also not on /dev/vdc.


  [node1]# lsblk 
    NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    vda    253:0    0   20G  0 disk 
    ├─vda1 253:1    0    2G  0 part 
    └─vda2 253:2    0   18G  0 part /
    vdb    253:16   0  100G  0 disk 
    ├─vdb1 253:17   0    5G  0 part 
    ├─vdb2 253:18   0    5G  0 part 
    ├─vdb3 253:19   0    5G  0 part 
    ├─vdb4 253:20   0    5G  0 part 
    └─vdb5 253:21   0    5G  0 part 
    vdc    253:32   0  100G  0 disk 
    └─vdc1 253:33   0  100G  0 part /var/lib/ceph/osd/TestClusterA-2
    vdd    253:48   0    1T  0 disk 
    └─vdd1 253:49   0 1024G  0 part /var/lib/ceph/osd/TestClusterA-4
    vde    253:64   0    1T  0 disk 
    └─vde1 253:65   0 1024G  0 part /var/lib/ceph/osd/TestClusterA-0
    vdf    253:80   0    1T  0 disk 
    └─vdf1 253:81   0 1024G  0 part /var/lib/ceph/osd/TestClusterA-3
    vdg    253:96   0    1T  0 disk 
    └─vdg1 253:97   0 1024G  0 part /var/lib/ceph/osd/TestClusterA-5
    vdh    253:112  0    1T  0 disk 
    └─vdh1 253:113  0    5G  0 part 
    vdi    253:128  0    1T  0 disk 
    └─vdi1 253:129  0 1024G  0 part /var/lib/ceph/osd/TestClusterA-1
    vdj    253:144  0    1T  0 disk 

  [node1]# ceph-disk list
    /dev/vda :
     /dev/vda1 other, swap
     /dev/vda2 other, xfs, mounted on /
    /dev/vdb :
     /dev/vdb3 ceph journal, for /dev/vdc1
     /dev/vdb1 ceph journal, for /dev/vde1
     /dev/vdb4 ceph journal, for /dev/vdf1
     /dev/vdb5 ceph journal, for /dev/vdg1
     /dev/vdb2 ceph journal, for /dev/vdi1
    /dev/vdc :
     /dev/vdc1 ceph data, active, cluster TestClusterA, osd.2, journal /dev/vdb3
    /dev/vdd :
     /dev/vdd1 ceph data, active, cluster TestClusterA, osd.4, journal /dev/vdh1
    /dev/vde :
     /dev/vde1 ceph data, active, cluster TestClusterA, osd.0, journal /dev/vdb1
    /dev/vdf :
     /dev/vdf1 ceph data, active, cluster TestClusterA, osd.3, journal /dev/vdb4
    /dev/vdg :
     /dev/vdg1 ceph data, active, cluster TestClusterA, osd.5, journal /dev/vdb5
    /dev/vdh :
     /dev/vdh1 ceph journal, for /dev/vdd1
    /dev/vdi :
     /dev/vdi1 ceph data, active, cluster TestClusterA, osd.1, journal /dev/vdb2
    /dev/vdj other, unknown
  
  -----------------------------------------------

  [node2]# lsblk 
    NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    vda    253:0    0   20G  0 disk 
    ├─vda1 253:1    0    2G  0 part 
    └─vda2 253:2    0   18G  0 part /
    vdb    253:16   0  100G  0 disk 
    ├─vdb1 253:17   0    5G  0 part 
    ├─vdb2 253:18   0    5G  0 part 
    ├─vdb3 253:19   0    5G  0 part 
    ├─vdb4 253:20   0    5G  0 part 
    └─vdb5 253:21   0    5G  0 part 
    vdc    253:32   0  100G  0 disk 
    └─vdc1 253:33   0    5G  0 part 
    vdd    253:48   0    1T  0 disk 
    └─vdd1 253:49   0 1024G  0 part /var/lib/ceph/osd/TestClusterA-11
    vde    253:64   0    1T  0 disk 
    └─vde1 253:65   0 1024G  0 part /var/lib/ceph/osd/TestClusterA-8
    vdf    253:80   0    1T  0 disk 
    └─vdf1 253:81   0 1024G  0 part /var/lib/ceph/osd/TestClusterA-10
    vdg    253:96   0    1T  0 disk 
    vdh    253:112  0    1T  0 disk 
    └─vdh1 253:113  0 1024G  0 part /var/lib/ceph/osd/TestClusterA-7
    vdi    253:128  0    1T  0 disk 
    └─vdi1 253:129  0 1024G  0 part /var/lib/ceph/osd/TestClusterA-9
    vdj    253:144  0    1T  0 disk 
    └─vdj1 253:145  0 1024G  0 part /var/lib/ceph/osd/TestClusterA-6

  [node2]# ceph-disk list
    /dev/vda :
     /dev/vda1 other, swap
     /dev/vda2 other, xfs, mounted on /
    /dev/vdb :
     /dev/vdb5 ceph journal, for /dev/vdd1
     /dev/vdb3 ceph journal, for /dev/vde1
     /dev/vdb4 ceph journal, for /dev/vdf1
     /dev/vdb2 ceph journal, for /dev/vdh1
     /dev/vdb1 ceph journal, for /dev/vdj1
    /dev/vdc :
     /dev/vdc1 ceph journal, for /dev/vdi1
    /dev/vdd :
     /dev/vdd1 ceph data, active, cluster TestClusterA, osd.11, journal /dev/vdb5
    /dev/vde :
     /dev/vde1 ceph data, active, cluster TestClusterA, osd.8, journal /dev/vdb3
    /dev/vdf :
     /dev/vdf1 ceph data, active, cluster TestClusterA, osd.10, journal /dev/vdb4
    /dev/vdg other, unknown
    /dev/vdh :
     /dev/vdh1 ceph data, active, cluster TestClusterA, osd.7, journal /dev/vdb2
    /dev/vdi :
     /dev/vdi1 ceph data, active, cluster TestClusterA, osd.9, journal /dev/vdc1
    /dev/vdj :
     /dev/vdj1 ceph data, active, cluster TestClusterA, osd.6, journal /dev/vdb1

  -----------------------------------------------

  [node3]# lsblk
    NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    vda    253:0    0   20G  0 disk 
    ├─vda1 253:1    0    2G  0 part 
    └─vda2 253:2    0   18G  0 part /
    vdb    253:16   0  100G  0 disk 
    ├─vdb1 253:17   0    5G  0 part 
    ├─vdb2 253:18   0    5G  0 part 
    ├─vdb3 253:19   0    5G  0 part 
    ├─vdb4 253:20   0    5G  0 part 
    └─vdb5 253:21   0    5G  0 part 
    vdc    253:32   0  100G  0 disk 
    └─vdc1 253:33   0  100G  0 part /var/lib/ceph/osd/TestClusterA-17
    vdd    253:48   0    1T  0 disk 
    └─vdd1 253:49   0 1024G  0 part /var/lib/ceph/osd/TestClusterA-16
    vde    253:64   0    1T  0 disk 
    vdf    253:80   0    1T  0 disk 
    └─vdf1 253:81   0 1024G  0 part /var/lib/ceph/osd/TestClusterA-15
    vdg    253:96   0    1T  0 disk 
    └─vdg1 253:97   0 1024G  0 part /var/lib/ceph/osd/TestClusterA-12
    vdh    253:112  0    1T  0 disk 
    └─vdh1 253:113  0 1024G  0 part /var/lib/ceph/osd/TestClusterA-14
    vdi    253:128  0    1T  0 disk 
    └─vdi1 253:129  0    5G  0 part 
    vdj    253:144  0    1T  0 disk 
    └─vdj1 253:145  0 1024G  0 part /var/lib/ceph/osd/TestClusterA-13

  [node3]# ceph-disk list
    /dev/vda :
     /dev/vda1 other, swap
     /dev/vda2 other, xfs, mounted on /
    /dev/vdb :
     /dev/vdb5 ceph journal, for /dev/vdc1
     /dev/vdb4 ceph journal, for /dev/vdd1
     /dev/vdb3 ceph journal, for /dev/vdf1
     /dev/vdb2 ceph journal, for /dev/vdh1
     /dev/vdb1 ceph journal, for /dev/vdj1
    /dev/vdc :
     /dev/vdc1 ceph data, active, cluster TestClusterA, osd.17, journal /dev/vdb5
    /dev/vdd :
     /dev/vdd1 ceph data, active, cluster TestClusterA, osd.16, journal /dev/vdb4
    /dev/vde other, unknown
    /dev/vdf :
     /dev/vdf1 ceph data, active, cluster TestClusterA, osd.15, journal /dev/vdb3
    /dev/vdg :
     /dev/vdg1 ceph data, active, cluster TestClusterA, osd.12, journal /dev/vdi1
    /dev/vdh :
     /dev/vdh1 ceph data, active, cluster TestClusterA, osd.14, journal /dev/vdb2
    /dev/vdi :
     /dev/vdi1 ceph journal, for /dev/vdg1
    /dev/vdj :
     /dev/vdj1 ceph data, active, cluster TestClusterA, osd.13, journal /dev/vdb1

  -----------------------------------------------

  [node4]# lsblk
    NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    vda    253:0    0   20G  0 disk 
    ├─vda1 253:1    0    2G  0 part 
    └─vda2 253:2    0   18G  0 part /
    vdb    253:16   0  100G  0 disk 
    ├─vdb1 253:17   0    5G  0 part 
    ├─vdb2 253:18   0    5G  0 part 
    ├─vdb3 253:19   0    5G  0 part 
    ├─vdb4 253:20   0    5G  0 part 
    ├─vdb5 253:21   0    5G  0 part 
    └─vdb6 253:22   0    5G  0 part 
    vdc    253:32   0  100G  0 disk 
    └─vdc1 253:33   0  100G  0 part /var/lib/ceph/osd/TestClusterA-21
    vdd    253:48   0    1T  0 disk 
    └─vdd1 253:49   0 1024G  0 part /var/lib/ceph/osd/TestClusterA-22
    vde    253:64   0    1T  0 disk 
    vdf    253:80   0    1T  0 disk 
    └─vdf1 253:81   0 1024G  0 part /var/lib/ceph/osd/TestClusterA-19
    vdg    253:96   0    1T  0 disk 
    vdh    253:112  0    1T  0 disk 
    └─vdh1 253:113  0 1024G  0 part /var/lib/ceph/osd/TestClusterA-18
    vdi    253:128  0    1T  0 disk 
    └─vdi1 253:129  0 1024G  0 part /var/lib/ceph/osd/TestClusterA-20
    vdj    253:144  0    1T  0 disk 
    └─vdj1 253:145  0 1024G  0 part /var/lib/ceph/osd/TestClusterA-23
    
  [node4]# ceph-disk list
    /dev/vda :
     /dev/vda1 other, swap
     /dev/vda2 other, xfs, mounted on /
    /dev/vdb :
     /dev/vdb4 ceph journal, for /dev/vdc1
     /dev/vdb5 ceph journal, for /dev/vdd1
     /dev/vdb2 ceph journal, for /dev/vdf1
     /dev/vdb1 ceph journal, for /dev/vdh1
     /dev/vdb3 ceph journal, for /dev/vdi1
     /dev/vdb6 ceph journal, for /dev/vdj1
    /dev/vdc :
     /dev/vdc1 ceph data, active, cluster TestClusterA, osd.21, journal /dev/vdb4
    /dev/vdd :
     /dev/vdd1 ceph data, active, cluster TestClusterA, osd.22, journal /dev/vdb5
    /dev/vde other, unknown
    /dev/vdf :
     /dev/vdf1 ceph data, active, cluster TestClusterA, osd.19, journal /dev/vdb2
    /dev/vdg other, unknown
    /dev/vdh :
     /dev/vdh1 ceph data, active, cluster TestClusterA, osd.18, journal /dev/vdb1
    /dev/vdi :
     /dev/vdi1 ceph data, active, cluster TestClusterA, osd.20, journal /dev/vdb3
    /dev/vdj :
     /dev/vdj1 ceph data, active, cluster TestClusterA, osd.23, journal /dev/vdb6

Comment 8 Shubhendu Tripathi 2016-05-31 09:46:57 UTC

Daniel,

So the logic goes like as below

1. Bigger sized disks are always given preferences to be used as data disk and smaller sized ones as journal disks
2. Mapping of data disks to their journal disk happens per node basis. So a disk cannot have its journal disk on another node
3. In case all the disks are rotational in nature, only one data disk can use another disk as journal disk
4. If SSDs are available, SSD are given preference to be used as journal disk (even if they are bigger in size, as SSDs can be used as journal disk for multiple data disks, to be specific upto 6 at the moment)
6. If there is a mix of rotational and SSD disks available on the node, first all the rotational disks start using SSDs as their journal. If SSDs exhaust serving as journal disk, rotational disk try using journal disk among themselves and smaller ones are selected as journal
7. If a condition arises that all rotational disks exhaust and still left with SSDs, SSDs try to use journals among themselves and within them smaller ones would be used as journal
8. For achieving this we always sort the rotational and SSDs disks on their descending order of size and them apply the above mentioned logic to reach to mapping of data disk to a journal disk

Hope this clarifies...

Comment 9 Daniel Horák 2016-05-31 10:45:18 UTC

Shubhendu,

I think I quite understand the process from your comments (4 and 8). In previous comment I tried to point to things which seems not completely in keeping with the description.

(In reply to Shubhendu Tripathi from comment #8)
> So the logic goes like as below
> 
> 1. Bigger sized disks are always given preferences to be used as data disk
> and smaller sized ones as journal disks

Why for example on node1 is the journal for /dev/vdd1 on /dev/vdh (size 1TB) and not on /dev/vdc (size 100GB)? (Sorry I made a typo in the question in previous comment and write vdb instead of vdc.)

> 2. Mapping of data disks to their journal disk happens per node basis. So a
> disk cannot have its journal disk on another node

Sure, this I perfectly understand.

> 3. In case all the disks are rotational in nature, only one data disk can
> use another disk as journal disk

This is also clear for me.

> 4. If SSDs are available, SSD are given preference to be used as journal
> disk (even if they are bigger in size, as SSDs can be used as journal disk
> for multiple data disks, to be specific upto 6 at the moment)

Why the SSD disks (/dev/vdb) work as journal sometimes for 5 OSD disks (node1,2,3) and sometime for 6 disks (node4)?

> 6. If there is a mix of rotational and SSD disks available on the node,
> first all the rotational disks start using SSDs as their journal. If SSDs
> exhaust serving as journal disk, rotational disk try using journal disk
> among themselves and smaller ones are selected as journal

Same question as for note 1.: Why for example on node1 is the journal for /dev/vdd1 on /dev/vdh (size 1TB) and not on /dev/vdc (size 100GB)?

> 7. If a condition arises that all rotational disks exhaust and still left
> with SSDs, SSDs try to use journals among themselves and within them smaller
> ones would be used as journal

> 8. For achieving this we always sort the rotational and SSDs disks on their
> descending order of size and them apply the above mentioned logic to reach
> to mapping of data disk to a journal disk

All the disks from vdd to vdj have the same size (1T), in that case I think it should be used in alphabetical order, because current "random/unsorted" state is very confusing for the administrator.

Comment 10 Shubhendu Tripathi 2016-06-02 08:48:14 UTC

(In reply to Daniel Horák from comment #9)
> Shubhendu,
> 
> I think I quite understand the process from your comments (4 and 8). In
> previous comment I tried to point to things which seems not completely in
> keeping with the description.
> 
> (In reply to Shubhendu Tripathi from comment #8)
> > So the logic goes like as below
> > 
> > 1. Bigger sized disks are always given preferences to be used as data disk
> > and smaller sized ones as journal disks
> 
> Why for example on node1 is the journal for /dev/vdd1 on /dev/vdh (size 1TB)
> and not on /dev/vdc (size 100GB)? (Sorry I made a typo in the question in
> previous comment and write vdb instead of vdc.)

It depends upon what is type of disk. If its SSD it would always be used as journal disk, whatever size it is of. rotational disks would always try to use SSDs (if any) as journal first. If no SSDs at all, rotational disks use smaller ones as journal from descending sorted order of disks.

> 
> > 2. Mapping of data disks to their journal disk happens per node basis. So a
> > disk cannot have its journal disk on another node
> 
> Sure, this I perfectly understand.
> 
> > 3. In case all the disks are rotational in nature, only one data disk can
> > use another disk as journal disk
> 
> This is also clear for me.
> 
> > 4. If SSDs are available, SSD are given preference to be used as journal
> > disk (even if they are bigger in size, as SSDs can be used as journal disk
> > for multiple data disks, to be specific upto 6 at the moment)
> 
> Why the SSD disks (/dev/vdb) work as journal sometimes for 5 OSD disks
> (node1,2,3) and sometime for 6 disks (node4)?

Currently SSD can be used as journal for a maximum of 6 disks. There is a patch to make this default to 4. If space is available and rotational disks want to use an SSD as journal, a maximum of 4 would be able to use. Even if space is left after that, it wont be utilized.

So effectively it depends on the available size of the SSD and how many rotational disks are available for data, to figure out how may journals would be paced on an SSD.

> 
> > 6. If there is a mix of rotational and SSD disks available on the node,
> > first all the rotational disks start using SSDs as their journal. If SSDs
> > exhaust serving as journal disk, rotational disk try using journal disk
> > among themselves and smaller ones are selected as journal
> 
> Same question as for note 1.: Why for example on node1 is the journal for
> /dev/vdd1 on /dev/vdh (size 1TB) and not on /dev/vdc (size 100GB)?
> 
> > 7. If a condition arises that all rotational disks exhaust and still left
> > with SSDs, SSDs try to use journals among themselves and within them smaller
> > ones would be used as journal
> 
> > 8. For achieving this we always sort the rotational and SSDs disks on their
> > descending order of size and them apply the above mentioned logic to reach
> > to mapping of data disk to a journal disk
> 
> All the disks from vdd to vdj have the same size (1T), in that case I think
> it should be used in alphabetical order, because current "random/unsorted"
> state is very confusing for the administrator.

If all the disks have same size, there is no way we use alphabatical order to do the mapping. All we do is use one as data and other as journal. Whatever order comes in sorted list we just follow through that. Using alphabatical names of the disks, I dont think is a good idea. As long as bigger disks are used as data and smaller ones as journal, it serves the purpose well.

Comment 11 Daniel Horák 2016-06-02 13:35:47 UTC

(In reply to Shubhendu Tripathi from comment #10)
> (In reply to Daniel Horák from comment #9)
> > Shubhendu,
> > 
> > I think I quite understand the process from your comments (4 and 8). In
> > previous comment I tried to point to things which seems not completely in
> > keeping with the description.
> > 
> > (In reply to Shubhendu Tripathi from comment #8)
> > > So the logic goes like as below
> > > 
> > > 1. Bigger sized disks are always given preferences to be used as data disk
> > > and smaller sized ones as journal disks
> > 
> > Why for example on node1 is the journal for /dev/vdd1 on /dev/vdh (size 1TB)
> > and not on /dev/vdc (size 100GB)? (Sorry I made a typo in the question in
> > previous comment and write vdb instead of vdc.)
> 
> It depends upon what is type of disk. If its SSD it would always be used as
> journal disk, whatever size it is of. rotational disks would always try to
> use SSDs (if any) as journal first. If no SSDs at all, rotational disks use
> smaller ones as journal from descending sorted order of disks.

Only /dev/vdb is "SSD" disk (to be correct, behaves as SSD disk), so neither vdh or vdc are SDDs.

> > 
> > > 2. Mapping of data disks to their journal disk happens per node basis. So a
> > > disk cannot have its journal disk on another node
> > 
> > Sure, this I perfectly understand.
> > 
> > > 3. In case all the disks are rotational in nature, only one data disk can
> > > use another disk as journal disk
> > 
> > This is also clear for me.
> > 
> > > 4. If SSDs are available, SSD are given preference to be used as journal
> > > disk (even if they are bigger in size, as SSDs can be used as journal disk
> > > for multiple data disks, to be specific upto 6 at the moment)
> > 
> > Why the SSD disks (/dev/vdb) work as journal sometimes for 5 OSD disks
> > (node1,2,3) and sometime for 6 disks (node4)?
> 
> Currently SSD can be used as journal for a maximum of 6 disks. There is a
> patch to make this default to 4. If space is available and rotational disks
> want to use an SSD as journal, a maximum of 4 would be able to use. Even if
> space is left after that, it wont be utilized.
> 
> So effectively it depends on the available size of the SSD and how many
> rotational disks are available for data, to figure out how may journals
> would be paced on an SSD.

As you can see in the lsblk output in comment 7, the vdb device on all nodes have 100GB, so enough space for 6*5GB.

> > 
> > > 6. If there is a mix of rotational and SSD disks available on the node,
> > > first all the rotational disks start using SSDs as their journal. If SSDs
> > > exhaust serving as journal disk, rotational disk try using journal disk
> > > among themselves and smaller ones are selected as journal
> > 
> > Same question as for note 1.: Why for example on node1 is the journal for
> > /dev/vdd1 on /dev/vdh (size 1TB) and not on /dev/vdc (size 100GB)?
> > 
> > > 7. If a condition arises that all rotational disks exhaust and still left
> > > with SSDs, SSDs try to use journals among themselves and within them smaller
> > > ones would be used as journal
> > 
> > > 8. For achieving this we always sort the rotational and SSDs disks on their
> > > descending order of size and them apply the above mentioned logic to reach
> > > to mapping of data disk to a journal disk
> > 
> > All the disks from vdd to vdj have the same size (1T), in that case I think
> > it should be used in alphabetical order, because current "random/unsorted"
> > state is very confusing for the administrator.
> 
> If all the disks have same size, there is no way we use alphabatical order
> to do the mapping. All we do is use one as data and other as journal.
> Whatever order comes in sorted list we just follow through that. Using
> alphabatical names of the disks, I dont think is a good idea. As long as
> bigger disks are used as data and smaller ones as journal, it serves the
> purpose well.

I understand the sorting accordingly to the size. From my point of view it would be helpful and looks better to sort disks with the (nearly) same size alphabetically. But it is just cosmetics (nice-to-have) issue.

Comment 12 Shubhendu Tripathi 2016-06-02 17:33:41 UTC

There is a bug around SSD being used as journal. Sent patch https://review.gerrithub.io/#/c/278720/ to resolve this.

Comment 13 Shubhendu Tripathi 2016-06-21 05:37:46 UTC

With patches

https://review.gerrithub.io/#/c/278720/ 
https://review.gerrithub.io/#/c/280447/

the journal mapping logic works as expected.

Moving to MODIFIED state..

Comment 14 Daniel Horák 2016-08-02 07:55:11 UTC

Tested on:
  USM Server (RHEL 7.2):
  ceph-ansible-1.0.5-31.el7scon.noarch
  ceph-installer-1.0.14-1.el7scon.noarch
  rhscon-ceph-0.0.38-1.el7scon.x86_64
  rhscon-core-0.0.38-1.el7scon.x86_64
  rhscon-core-selinux-0.0.38-1.el7scon.noarch
  rhscon-ui-0.0.51-1.el7scon.noarch

  Ceph MON (RHEL 7.2):
  calamari-server-1.4.7-1.el7cp.x86_64
  ceph-base-10.2.2-32.el7cp.x86_64
  ceph-common-10.2.2-32.el7cp.x86_64
  ceph-mon-10.2.2-32.el7cp.x86_64
  ceph-selinux-10.2.2-32.el7cp.x86_64
  libcephfs1-10.2.2-32.el7cp.x86_64
  python-cephfs-10.2.2-32.el7cp.x86_64
  rhscon-agent-0.0.16-1.el7scon.noarch
  rhscon-core-selinux-0.0.38-1.el7scon.noarch

  Ceph OSD (RHEL 7.2):
  ceph-base-10.2.2-32.el7cp.x86_64
  ceph-common-10.2.2-32.el7cp.x86_64
  ceph-osd-10.2.2-32.el7cp.x86_64
  ceph-selinux-10.2.2-32.el7cp.x86_64
  libcephfs1-10.2.2-32.el7cp.x86_64
  python-cephfs-10.2.2-32.el7cp.x86_64
  rhscon-agent-0.0.16-1.el7scon.noarch
  rhscon-core-selinux-0.0.38-1.el7scon.noarch

The algorithm works as described in this bug.

I've created new RFE bug for more predictably distributed OSDs and journal across the same disks:
  Bug 1362431 - [RFE] distribute OSDs and journal more predictably across the available disks

>> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.