Description of problem: Creating crushmap ruleset to have primary replicas on SSD OSDs and secondary on HDD OSDs, when these OSD are located on the same host, will place 2 replicas on the same host, primary on SSD OSD and secondary on HDD OSD on the same host. Version-Release number of selected component (if applicable): # ceph version ceph version 12.2.1-39.el7cp (22e26be5a4920c95c43f647b31349484f663e4b9) luminous (stable) How reproducible: Always Steps to Reproduce: 1. create ceph environment where SSD OSDs and HDD OSDs are on the same host 2. follow the steps to create CRUSH map ruleset for primary replicas on SSD OSDs and, secondary on HDD OSDs 3. check that 2 replicas of the same PG may land on the same host Actual results: Expected results: Additional info:
reproduced steps: Crush map decompiled: # cat crush.decom.bz # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable straw_calc_version 1 # devices device 0 osd.0 class ssd device 1 osd.1 class ssd device 2 osd.2 class ssd device 3 osd.3 class ssd device 4 osd.4 class ssd device 5 osd.5 class hdd device 6 osd.6 class hdd device 7 osd.7 class hdd device 8 osd.8 class hdd device 9 osd.9 class hdd device 10 osd.10 class hdd device 11 osd.11 class hdd device 12 osd.12 class hdd device 13 osd.13 class hdd device 14 osd.14 class hdd device 15 osd.15 class hdd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host osds-0 { id -2 # do not change unnecessarily id -7 class hdd # do not change unnecessarily id -13 class ssd # do not change unnecessarily # weight 0.113 alg straw hash 0 # rjenkins1 item osd.0 weight 0.025 item osd.8 weight 0.029 item osd.14 weight 0.029 item osd.15 weight 0.029 } host osds-3 { id -3 # do not change unnecessarily id -8 class hdd # do not change unnecessarily id -14 class ssd # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.1 weight 0.025 item osd.6 weight 0.029 item osd.12 weight 0.029 } host osds-2 { id -4 # do not change unnecessarily id -9 class hdd # do not change unnecessarily id -15 class ssd # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.2 weight 0.025 item osd.7 weight 0.029 item osd.11 weight 0.029 } host osds-1 { id -5 # do not change unnecessarily id -10 class hdd # do not change unnecessarily id -16 class ssd # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.4 weight 0.025 item osd.5 weight 0.029 item osd.10 weight 0.029 } host osds-4 { id -6 # do not change unnecessarily id -11 class hdd # do not change unnecessarily id -17 class ssd # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.3 weight 0.025 item osd.9 weight 0.029 item osd.13 weight 0.029 } root default { id -1 # do not change unnecessarily id -12 class hdd # do not change unnecessarily id -18 class ssd # do not change unnecessarily # weight 0.449 alg straw hash 0 # rjenkins1 item osds-0 weight 0.113 item osds-3 weight 0.084 item osds-2 weight 0.084 item osds-1 weight 0.084 item osds-4 weight 0.084 } # rules rule ssd-pool { id 0 type replicated min_size 1 max_size 10 step take default class ssd step chooseleaf firstn 0 type host step emit } rule hdd-pool { id 1 type replicated min_size 1 max_size 10 step take default class hdd step chooseleaf firstn 0 type host step emit } rule ssd-primary { id 2 type replicated min_size 1 max_size 10 step take default class ssd step chooseleaf firstn 1 type host step emit step take default class hdd step chooseleaf firstn -1 type host step emit } # end crush map # ceph osd tree (test system, the ssd OSDs are not actual SSDs) ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.44899 root default -2 0.11299 host osds-0 8 hdd 0.02899 osd.8 up 1.00000 1.00000 14 hdd 0.02899 osd.14 up 1.00000 1.00000 15 hdd 0.02899 osd.15 up 1.00000 1.00000 0 ssd 0.02499 osd.0 up 1.00000 1.00000 -5 0.08400 host osds-1 5 hdd 0.02899 osd.5 up 1.00000 1.00000 10 hdd 0.02899 osd.10 up 1.00000 1.00000 4 ssd 0.02499 osd.4 up 1.00000 1.00000 -4 0.08400 host osds-2 7 hdd 0.02899 osd.7 up 1.00000 1.00000 11 hdd 0.02899 osd.11 up 1.00000 1.00000 2 ssd 0.02499 osd.2 up 1.00000 1.00000 -3 0.08400 host osds-3 6 hdd 0.02899 osd.6 up 1.00000 1.00000 12 hdd 0.02899 osd.12 up 1.00000 1.00000 1 ssd 0.02499 osd.1 up 1.00000 1.00000 -6 0.08400 host osds-4 9 hdd 0.02899 osd.9 up 1.00000 1.00000 13 hdd 0.02899 osd.13 up 1.00000 1.00000 3 ssd 0.02499 osd.3 up 1.00000 1.00000 - pool creation [root@mons-0 ~]# ceph osd pool create fast-ssd 8 8 replicated ssd-pool pool 'fast-ssd' created [root@mons-0 ~]# ceph osd pool create slow-hdd 8 8 replicated hdd-pool pool 'slow-hdd' created [root@mons-0 ~]# ceph osd pool create mixed 8 8 replicated ssd-primary pool 3 'fast-ssd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 122 flags hashpspool stripe_width 0 pool 4 'slow-hdd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 last_change 125 flags hashpspool stripe_width 0 pool 5 'mixed' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 8 pgp_num 8 last_change 129 flags hashpspool stripe_width 0 # for i in `seq 3 5` ; do for j in `seq 0 7`; do ceph pg map $i.$j ; done; done osdmap e142 pg 3.0 (3.0) -> up [3,1,0] acting [3,1,0] osdmap e142 pg 3.1 (3.1) -> up [2,3,1] acting [2,3,1] osdmap e142 pg 3.2 (3.2) -> up [4,3,1] acting [4,3,1] osdmap e142 pg 3.3 (3.3) -> up [4,0,2] acting [4,0,2] osdmap e142 pg 3.4 (3.4) -> up [1,4,3] acting [1,4,3] osdmap e142 pg 3.5 (3.5) -> up [0,4,2] acting [0,4,2] osdmap e142 pg 3.6 (3.6) -> up [1,3,2] acting [1,3,2] osdmap e142 pg 3.7 (3.7) -> up [2,4,1] acting [2,4,1] ^^^ all SSDs - check OK osdmap e142 pg 4.0 (4.0) -> up [9,10,11] acting [9,10,11] osdmap e142 pg 4.1 (4.1) -> up [6,13,14] acting [6,13,14] osdmap e142 pg 4.2 (4.2) -> up [5,14,12] acting [5,14,12] osdmap e142 pg 4.3 (4.3) -> up [12,15,9] acting [12,15,9] osdmap e142 pg 4.4 (4.4) -> up [11,15,12] acting [11,15,12] osdmap e142 pg 4.5 (4.5) -> up [9,6,11] acting [9,6,11] osdmap e142 pg 4.6 (4.6) -> up [9,11,5] acting [9,11,5] osdmap e142 pg 4.7 (4.7) -> up [10,9,15] acting [10,9,15] ^^^ all HDDs - check OK osdmap e142 pg 5.0 (5.0) -> up [3,13,14] acting [3,13,14] ^^^ osd 3 and 13 are on the same host osds-4 osdmap e142 pg 5.1 (5.1) -> up [2,14,7] acting [2,14,7] ^^^ osd.2 and 7 are on the same host osds-2 osdmap e142 pg 5.2 (5.2) -> up [3,9,8] acting [3,9,8] ^^^ 3 and 9 on the same host osds-4 osdmap e142 pg 5.3 (5.3) -> up [4,15,6] acting [4,15,6] osdmap e142 pg 5.4 (5.4) -> up [0,12,5] acting [0,12,5] osdmap e142 pg 5.5 (5.5) -> up [2,15,9] acting [2,15,9] osdmap e142 pg 5.6 (5.6) -> up [3,6,5] acting [3,6,5] osdmap e142 pg 5.7 (5.7) -> up [0,5,13] acting [0,5,13]
following the same steps that are in the upstream link http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/#placing-different-pools-on-different-osds # cat decomp # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable straw_calc_version 1 # devices device 0 osd.0 class ssd device 1 osd.1 class ssd device 2 osd.2 class ssd device 3 osd.3 class ssd device 4 osd.4 class ssd device 5 osd.5 class hdd device 6 osd.6 class hdd device 7 osd.7 class hdd device 8 osd.8 class hdd device 9 osd.9 class hdd device 10 osd.10 class hdd device 11 osd.11 class hdd device 12 osd.12 class hdd device 13 osd.13 class hdd device 14 osd.14 class hdd device 15 osd.15 class hdd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host osds-0-hdd { id -2 # do not change unnecessarily # weight 0.088 alg straw hash 0 # rjenkins1 item osd.8 weight 0.029 item osd.14 weight 0.029 item osd.15 weight 0.029 } host osds-0-ssd { id -7 # do not change unnecessarily # weight 0.025 alg straw hash 0 # rjenkins1 item osd.0 weight 0.025 } host osds-3-hdd { id -3 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.6 weight 0.029 item osd.12 weight 0.029 } host osds-3-ssd { id -8 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.1 weight 0.025 } host osds-2-hdd { id -4 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.7 weight 0.029 item osd.11 weight 0.029 } host osds-2-ssd { id -9 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.2 weight 0.025 } host osds-1-hdd { id -5 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.5 weight 0.029 item osd.10 weight 0.029 } host osds-1-ssd { id -10 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.4 weight 0.025 } host osds-4-hdd { id -6 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.9 weight 0.029 item osd.13 weight 0.029 } host osds-4-ssd { id -11 # do not change unnecessarily # weight 0.084 alg straw hash 0 # rjenkins1 item osd.3 weight 0.025 } #root buckets root ssd-root { id -12 # do not change unnecessarily # weight 0.449 alg straw hash 0 # rjenkins1 item osds-0-ssd weight 0.025 item osds-3-ssd weight 0.029 item osds-2-ssd weight 0.029 item osds-1-ssd weight 0.029 item osds-4-ssd weight 0.029 } root hdd-root { id -1 # do not change unnecessarily # weight 0.449 alg straw hash 0 # rjenkins1 item osds-0-hdd weight 0.088 item osds-3-hdd weight 0.058 item osds-2-hdd weight 0.058 item osds-1-hdd weight 0.058 item osds-4-hdd weight 0.058 } # rules rule ssd-pool { id 0 type replicated min_size 1 max_size 10 step take ssd-root step chooseleaf firstn 0 type host step emit } rule hdd-pool { id 1 type replicated min_size 1 max_size 10 step take hdd-root step chooseleaf firstn 0 type host step emit } rule ssd-primary { id 2 type replicated min_size 1 max_size 10 step take ssd-root step chooseleaf firstn 1 type host step emit step take hdd-root step chooseleaf firstn -1 type host step emit } # end crush map # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -12 0.14096 root ssd-root -7 0.02499 host osds-0-ssd 0 ssd 0.02499 osd.0 up 1.00000 1.00000 -10 0.02899 host osds-1-ssd 4 ssd 0.02499 osd.4 up 1.00000 1.00000 -9 0.02899 host osds-2-ssd 2 ssd 0.02499 osd.2 up 1.00000 1.00000 -8 0.02899 host osds-3-ssd 1 ssd 0.02499 osd.1 up 1.00000 1.00000 -11 0.02899 host osds-4-ssd 3 ssd 0.02499 osd.3 up 1.00000 1.00000 -1 0.31999 root hdd-root -2 0.08800 host osds-0-hdd 8 hdd 0.02899 osd.8 up 1.00000 1.00000 14 hdd 0.02899 osd.14 up 1.00000 1.00000 15 hdd 0.02899 osd.15 up 1.00000 1.00000 -5 0.05800 host osds-1-hdd 5 hdd 0.02899 osd.5 up 1.00000 1.00000 10 hdd 0.02899 osd.10 up 1.00000 1.00000 -4 0.05800 host osds-2-hdd 7 hdd 0.02899 osd.7 up 1.00000 1.00000 11 hdd 0.02899 osd.11 up 1.00000 1.00000 -3 0.05800 host osds-3-hdd 6 hdd 0.02899 osd.6 up 1.00000 1.00000 12 hdd 0.02899 osd.12 up 1.00000 1.00000 -6 0.05800 host osds-4-hdd 9 hdd 0.02899 osd.9 up 1.00000 1.00000 13 hdd 0.02899 osd.13 up 1.00000 1.00000 # for i in `seq 3 5` ; do for j in `seq 0 7`; do ceph pg map $i.$j ; done; done osdmap e144 pg 3.0 (3.0) -> up [4,3,2] acting [4,3,2] osdmap e144 pg 3.1 (3.1) -> up [4,2,3] acting [4,2,3] osdmap e144 pg 3.2 (3.2) -> up [2,0,3] acting [2,0,3] osdmap e144 pg 3.3 (3.3) -> up [2,1,4] acting [2,1,4] osdmap e144 pg 3.4 (3.4) -> up [1,0,4] acting [1,0,4] osdmap e144 pg 3.5 (3.5) -> up [4,0,1] acting [4,0,1] osdmap e144 pg 3.6 (3.6) -> up [4,1,3] acting [4,1,3] osdmap e144 pg 3.7 (3.7) -> up [3,1,2] acting [3,1,2] ^^^ all SSDs - check OK osdmap e144 pg 4.0 (4.0) -> up [11,13,15] acting [11,13,15] osdmap e144 pg 4.1 (4.1) -> up [11,5,13] acting [11,5,13] osdmap e144 pg 4.2 (4.2) -> up [9,7,6] acting [9,7,6] osdmap e144 pg 4.3 (4.3) -> up [15,9,12] acting [15,9,12] osdmap e144 pg 4.4 (4.4) -> up [6,11,10] acting [6,11,10] osdmap e144 pg 4.5 (4.5) -> up [9,10,15] acting [9,10,15] osdmap e144 pg 4.6 (4.6) -> up [14,6,11] acting [14,6,11] osdmap e144 pg 4.7 (4.7) -> up [14,9,6] acting [14,9,6] ^^^ all HDDs - check OK osdmap e144 pg 5.0 (5.0) -> up [3,13,14] acting [3,13,14] ^^^ osd 3 and 13 are on the same host osds-4 osdmap e144 pg 5.1 (5.1) -> up [2,5,6] acting [2,5,6] osdmap e144 pg 5.2 (5.2) -> up [3,10,13] acting [3,10,13] ^^^ osd 3 and 13 are on the same host osds-4 osdmap e144 pg 5.3 (5.3) -> up [2,6,11] acting [2,6,11] ^^^ osd 2 and 11 are on the same host osds-2 osdmap e144 pg 5.4 (5.4) -> up [1,13,12] acting [1,13,12] ^^^ osd 1 and 12 on osds-3-hdd osdmap e144 pg 5.5 (5.5) -> up [0,9,7] acting [0,9,7] osdmap e144 pg 5.6 (5.6) -> up [1,6,5] acting [1,6,5] ^^^ osd 1 and 6 on osds-3-hdd osdmap e144 pg 5.7 (5.7) -> up [4,5,13] acting [4,5,13]
There are various workarounds that can let users approach this problem, but a generic way of handling this is unfortunately beyond the plausible scope of CRUSH. I suspect the best approach would be to add a replica-affinity parameter that mirrors the primary-affinity so we could set SSDs to be primaries and not replicas, and vice-versa on hard drives. If it becomes a strategic imperative we can open an RFE for that.