Bug 1517128
| Summary: | [RFE] CRUSH map ruleset for primary replicas on SSD OSDs and, secondary on HDD OSDs on hosts with mix of SSD and HDD OSDs may place 2 replicas on the same host | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Tomas Petr <tpetr> |
| Component: | RADOS | Assignee: | Josh Durgin <jdurgin> |
| Status: | CLOSED WONTFIX | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.0 | CC: | bhubbard, ceph-eng-bugs, dn-infra-peta-pers, dzafman, gfarnum, hklein, jdurgin, kchai, vumrao |
| Target Milestone: | rc | Keywords: | FutureFeature |
| Target Release: | 3.* | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-03-07 23:21:51 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Tomas Petr
2017-11-24 09:45:01 UTC
reproduced steps:
Crush map decompiled:
# cat crush.decom.bz
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host osds-0 {
id -2 # do not change unnecessarily
id -7 class hdd # do not change unnecessarily
id -13 class ssd # do not change unnecessarily
# weight 0.113
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.025
item osd.8 weight 0.029
item osd.14 weight 0.029
item osd.15 weight 0.029
}
host osds-3 {
id -3 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 0.084
alg straw
hash 0 # rjenkins1
item osd.1 weight 0.025
item osd.6 weight 0.029
item osd.12 weight 0.029
}
host osds-2 {
id -4 # do not change unnecessarily
id -9 class hdd # do not change unnecessarily
id -15 class ssd # do not change unnecessarily
# weight 0.084
alg straw
hash 0 # rjenkins1
item osd.2 weight 0.025
item osd.7 weight 0.029
item osd.11 weight 0.029
}
host osds-1 {
id -5 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
id -16 class ssd # do not change unnecessarily
# weight 0.084
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.025
item osd.5 weight 0.029
item osd.10 weight 0.029
}
host osds-4 {
id -6 # do not change unnecessarily
id -11 class hdd # do not change unnecessarily
id -17 class ssd # do not change unnecessarily
# weight 0.084
alg straw
hash 0 # rjenkins1
item osd.3 weight 0.025
item osd.9 weight 0.029
item osd.13 weight 0.029
}
root default {
id -1 # do not change unnecessarily
id -12 class hdd # do not change unnecessarily
id -18 class ssd # do not change unnecessarily
# weight 0.449
alg straw
hash 0 # rjenkins1
item osds-0 weight 0.113
item osds-3 weight 0.084
item osds-2 weight 0.084
item osds-1 weight 0.084
item osds-4 weight 0.084
}
# rules
rule ssd-pool {
id 0
type replicated
min_size 1
max_size 10
step take default class ssd
step chooseleaf firstn 0 type host
step emit
}
rule hdd-pool {
id 1
type replicated
min_size 1
max_size 10
step take default class hdd
step chooseleaf firstn 0 type host
step emit
}
rule ssd-primary {
id 2
type replicated
min_size 1
max_size 10
step take default class ssd
step chooseleaf firstn 1 type host
step emit
step take default class hdd
step chooseleaf firstn -1 type host
step emit
}
# end crush map
# ceph osd tree (test system, the ssd OSDs are not actual SSDs)
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.44899 root default
-2 0.11299 host osds-0
8 hdd 0.02899 osd.8 up 1.00000 1.00000
14 hdd 0.02899 osd.14 up 1.00000 1.00000
15 hdd 0.02899 osd.15 up 1.00000 1.00000
0 ssd 0.02499 osd.0 up 1.00000 1.00000
-5 0.08400 host osds-1
5 hdd 0.02899 osd.5 up 1.00000 1.00000
10 hdd 0.02899 osd.10 up 1.00000 1.00000
4 ssd 0.02499 osd.4 up 1.00000 1.00000
-4 0.08400 host osds-2
7 hdd 0.02899 osd.7 up 1.00000 1.00000
11 hdd 0.02899 osd.11 up 1.00000 1.00000
2 ssd 0.02499 osd.2 up 1.00000 1.00000
-3 0.08400 host osds-3
6 hdd 0.02899 osd.6 up 1.00000 1.00000
12 hdd 0.02899 osd.12 up 1.00000 1.00000
1 ssd 0.02499 osd.1 up 1.00000 1.00000
-6 0.08400 host osds-4
9 hdd 0.02899 osd.9 up 1.00000 1.00000
13 hdd 0.02899 osd.13 up 1.00000 1.00000
3 ssd 0.02499 osd.3 up 1.00000 1.00000
- pool creation
[root@mons-0 ~]# ceph osd pool create fast-ssd 8 8 replicated ssd-pool
pool 'fast-ssd' created
[root@mons-0 ~]# ceph osd pool create slow-hdd 8 8 replicated hdd-pool
pool 'slow-hdd' created
[root@mons-0 ~]# ceph osd pool create mixed 8 8 replicated ssd-primary
pool 3 'fast-ssd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 122 flags hashpspool stripe_width 0
pool 4 'slow-hdd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 last_change 125 flags hashpspool stripe_width 0
pool 5 'mixed' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 8 pgp_num 8 last_change 129 flags hashpspool stripe_width 0
# for i in `seq 3 5` ; do for j in `seq 0 7`; do ceph pg map $i.$j ; done; done
osdmap e142 pg 3.0 (3.0) -> up [3,1,0] acting [3,1,0]
osdmap e142 pg 3.1 (3.1) -> up [2,3,1] acting [2,3,1]
osdmap e142 pg 3.2 (3.2) -> up [4,3,1] acting [4,3,1]
osdmap e142 pg 3.3 (3.3) -> up [4,0,2] acting [4,0,2]
osdmap e142 pg 3.4 (3.4) -> up [1,4,3] acting [1,4,3]
osdmap e142 pg 3.5 (3.5) -> up [0,4,2] acting [0,4,2]
osdmap e142 pg 3.6 (3.6) -> up [1,3,2] acting [1,3,2]
osdmap e142 pg 3.7 (3.7) -> up [2,4,1] acting [2,4,1]
^^^ all SSDs - check OK
osdmap e142 pg 4.0 (4.0) -> up [9,10,11] acting [9,10,11]
osdmap e142 pg 4.1 (4.1) -> up [6,13,14] acting [6,13,14]
osdmap e142 pg 4.2 (4.2) -> up [5,14,12] acting [5,14,12]
osdmap e142 pg 4.3 (4.3) -> up [12,15,9] acting [12,15,9]
osdmap e142 pg 4.4 (4.4) -> up [11,15,12] acting [11,15,12]
osdmap e142 pg 4.5 (4.5) -> up [9,6,11] acting [9,6,11]
osdmap e142 pg 4.6 (4.6) -> up [9,11,5] acting [9,11,5]
osdmap e142 pg 4.7 (4.7) -> up [10,9,15] acting [10,9,15]
^^^ all HDDs - check OK
osdmap e142 pg 5.0 (5.0) -> up [3,13,14] acting [3,13,14]
^^^ osd 3 and 13 are on the same host osds-4
osdmap e142 pg 5.1 (5.1) -> up [2,14,7] acting [2,14,7]
^^^ osd.2 and 7 are on the same host osds-2
osdmap e142 pg 5.2 (5.2) -> up [3,9,8] acting [3,9,8]
^^^ 3 and 9 on the same host osds-4
osdmap e142 pg 5.3 (5.3) -> up [4,15,6] acting [4,15,6]
osdmap e142 pg 5.4 (5.4) -> up [0,12,5] acting [0,12,5]
osdmap e142 pg 5.5 (5.5) -> up [2,15,9] acting [2,15,9]
osdmap e142 pg 5.6 (5.6) -> up [3,6,5] acting [3,6,5]
osdmap e142 pg 5.7 (5.7) -> up [0,5,13] acting [0,5,13]
following the same steps that are in the upstream link http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/#placing-different-pools-on-different-osds # cat decomp # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable straw_calc_version 1 # devices device 0 osd.0 class ssd device 1 osd.1 class ssd device 2 osd.2 class ssd device 3 osd.3 class ssd device 4 osd.4 class ssd device 5 osd.5 class hdd device 6 osd.6 class hdd device 7 osd.7 class hdd device 8 osd.8 class hdd device 9 osd.9 class hdd device 10 osd.10 class hdd device 11 osd.11 class hdd device 12 osd.12 class hdd device 13 osd.13 class hdd device 14 osd.14 class hdd device 15 osd.15 class hdd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host osds-0-hdd { id -2 # do not change unnecessarily # weight 0.088 alg straw hash 0 # rjenkins1 item osd.8 weight 0.029 item osd.14 weight 0.029 item osd.15 weight 0.029 } host osds-0-ssd { id -7 # do not change unnecessarily # weight 0.025 alg straw hash 0 # rjenkins1 item osd.0 weight 0.025 } host osds-3-hdd { id -3 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.6 weight 0.029 item osd.12 weight 0.029 } host osds-3-ssd { id -8 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.1 weight 0.025 } host osds-2-hdd { id -4 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.7 weight 0.029 item osd.11 weight 0.029 } host osds-2-ssd { id -9 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.2 weight 0.025 } host osds-1-hdd { id -5 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.5 weight 0.029 item osd.10 weight 0.029 } host osds-1-ssd { id -10 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.4 weight 0.025 } host osds-4-hdd { id -6 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.9 weight 0.029 item osd.13 weight 0.029 } host osds-4-ssd { id -11 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.3 weight 0.025 } #root buckets root ssd-root { id -12 # do not change unnecessarily # weight 0.449 alg straw hash 0 # rjenkins1 item osds-0-ssd weight 0.025 item osds-3-ssd weight 0.029 item osds-2-ssd weight 0.029 item osds-1-ssd weight 0.029 item osds-4-ssd weight 0.029 } root hdd-root { id -1 # do not change unnecessarily # weight 0.449 alg straw hash 0 # rjenkins1 item osds-0-hdd weight 0.088 item osds-3-hdd weight 0.058 item osds-2-hdd weight 0.058 item osds-1-hdd weight 0.058 item osds-4-hdd weight 0.058 } # rules rule ssd-pool { id 0 type replicated min_size 1 max_size 10 step take ssd-root step chooseleaf firstn 0 type host step emit } rule hdd-pool { id 1 type replicated min_size 1 max_size 10 step take hdd-root step chooseleaf firstn 0 type host step emit } rule ssd-primary { id 2 type replicated min_size 1 max_size 10 step take ssd-root step chooseleaf firstn 1 type host step emit step take hdd-root step chooseleaf firstn -1 type host step emit } # end crush map # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -12 0.14096 root ssd-root -7 0.02499 host osds-0-ssd 0 ssd 0.02499 osd.0 up 1.00000 1.00000 -10 0.02899 host osds-1-ssd 4 ssd 0.02499 osd.4 up 1.00000 1.00000 -9 0.02899 host osds-2-ssd 2 ssd 0.02499 osd.2 up 1.00000 1.00000 -8 0.02899 host osds-3-ssd 1 ssd 0.02499 osd.1 up 1.00000 1.00000 -11 0.02899 host osds-4-ssd 3 ssd 0.02499 osd.3 up 1.00000 1.00000 -1 0.31999 root hdd-root -2 0.08800 host osds-0-hdd 8 hdd 0.02899 osd.8 up 1.00000 1.00000 14 hdd 0.02899 osd.14 up 1.00000 1.00000 15 hdd 0.02899 osd.15 up 1.00000 1.00000 -5 0.05800 host osds-1-hdd 5 hdd 0.02899 osd.5 up 1.00000 1.00000 10 hdd 0.02899 osd.10 up 1.00000 1.00000 -4 0.05800 host osds-2-hdd 7 hdd 0.02899 osd.7 up 1.00000 1.00000 11 hdd 0.02899 osd.11 up 1.00000 1.00000 -3 0.05800 host osds-3-hdd 6 hdd 0.02899 osd.6 up 1.00000 1.00000 12 hdd 0.02899 osd.12 up 1.00000 1.00000 -6 0.05800 host osds-4-hdd 9 hdd 0.02899 osd.9 up 1.00000 1.00000 13 hdd 0.02899 osd.13 up 1.00000 1.00000 # for i in `seq 3 5` ; do for j in `seq 0 7`; do ceph pg map $i.$j ; done; done osdmap e144 pg 3.0 (3.0) -> up [4,3,2] acting [4,3,2] osdmap e144 pg 3.1 (3.1) -> up [4,2,3] acting [4,2,3] osdmap e144 pg 3.2 (3.2) -> up [2,0,3] acting [2,0,3] osdmap e144 pg 3.3 (3.3) -> up [2,1,4] acting [2,1,4] osdmap e144 pg 3.4 (3.4) -> up [1,0,4] acting [1,0,4] osdmap e144 pg 3.5 (3.5) -> up [4,0,1] acting [4,0,1] osdmap e144 pg 3.6 (3.6) -> up [4,1,3] acting [4,1,3] osdmap e144 pg 3.7 (3.7) -> up [3,1,2] acting [3,1,2] ^^^ all SSDs - check OK osdmap e144 pg 4.0 (4.0) -> up [11,13,15] acting [11,13,15] osdmap e144 pg 4.1 (4.1) -> up [11,5,13] acting [11,5,13] osdmap e144 pg 4.2 (4.2) -> up [9,7,6] acting [9,7,6] osdmap e144 pg 4.3 (4.3) -> up [15,9,12] acting [15,9,12] osdmap e144 pg 4.4 (4.4) -> up [6,11,10] acting [6,11,10] osdmap e144 pg 4.5 (4.5) -> up [9,10,15] acting [9,10,15] osdmap e144 pg 4.6 (4.6) -> up [14,6,11] acting [14,6,11] osdmap e144 pg 4.7 (4.7) -> up [14,9,6] acting [14,9,6] ^^^ all HDDs - check OK osdmap e144 pg 5.0 (5.0) -> up [3,13,14] acting [3,13,14] ^^^ osd 3 and 13 are on the same host osds-4 osdmap e144 pg 5.1 (5.1) -> up [2,5,6] acting [2,5,6] osdmap e144 pg 5.2 (5.2) -> up [3,10,13] acting [3,10,13] ^^^ osd 3 and 13 are on the same host osds-4 osdmap e144 pg 5.3 (5.3) -> up [2,6,11] acting [2,6,11] ^^^ osd 2 and 11 are on the same host osds-2 osdmap e144 pg 5.4 (5.4) -> up [1,13,12] acting [1,13,12] ^^^ osd 1 and 12 on osds-3-hdd osdmap e144 pg 5.5 (5.5) -> up [0,9,7] acting [0,9,7] osdmap e144 pg 5.6 (5.6) -> up [1,6,5] acting [1,6,5] ^^^ osd 1 and 6 on osds-3-hdd osdmap e144 pg 5.7 (5.7) -> up [4,5,13] acting [4,5,13] There are various workarounds that can let users approach this problem, but a generic way of handling this is unfortunately beyond the plausible scope of CRUSH. I suspect the best approach would be to add a replica-affinity parameter that mirrors the primary-affinity so we could set SSDs to be primaries and not replicas, and vice-versa on hard drives. If it becomes a strategic imperative we can open an RFE for that. |