Created attachment 1202447 [details] crush map Description of problem: There are several crush rule sets. Where all of them have two disks available. Disks are exclusively available only for one of any of those rule sets. Hence default rule set has only two disks. However they are full even when the pool using this rule set has zero objects. Version-Release number of selected component (if applicable): libcephfs1-10.2.2-41.el7cp.x86_64 ceph-common-10.2.2-41.el7cp.x86_64 ceph-selinux-10.2.2-41.el7cp.x86_64 python-cephfs-10.2.2-41.el7cp.x86_64 ceph-base-10.2.2-41.el7cp.x86_64 How reproducible: 80% Steps to Reproduce: 1. Create cluster using Red Hat Storage Console 2.0. During creation create several new storage profiles. Each of them should have two disks. Leave two disks in default profile. 2. Create one pool for each storage profile with replication. Some of newly created pools have replication number set to 4. All others have replication number set to 2. 3. Start filling pools except the pool using the default profile. 4. Investigate ceph osd status. E.g. (ceph poolDef is pool with default storage profile) # ceph osd df ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 3 0.04900 1.00000 51173M 41606M 9567M 81.30 1.23 128 10 0.04900 1.00000 51173M 41605M 9568M 81.30 1.23 128 4 0.04900 1.00000 51173M 45097M 6076M 88.12 1.33 128 6 0.04900 1.00000 51173M 45096M 6077M 88.12 1.33 128 1 0.04900 1.00000 51173M 41000M 10173M 80.12 1.21 128 11 0.04900 1.00000 51173M 41000M 10173M 80.12 1.21 128 5 0.04900 1.00000 51173M 7185M 43988M 14.04 0.21 128 8 0.04900 1.00000 51173M 7186M 43987M 14.04 0.21 128 2 0.04900 1.00000 51173M 19516M 31657M 38.14 0.58 128 9 0.04900 1.00000 51173M 19516M 31657M 38.14 0.58 128 0 0.04900 1.00000 51173M 48754M 2419M 95.27 1.44 384 7 0.04900 1.00000 51173M 48756M 2417M 95.28 1.44 384 TOTAL 599G 396G 202G 66.17 MIN/MAX VAR: 0.21/1.44 STDDEV: 29.60 # ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 599G 202G 396G 66.17 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS pool1 1 19476M 38.09 31657M 20 pool2 2 7144M 24.52 21993M 7 pool3 3 40960M 80.10 10173M 40 pool4 4 45056M 88.12 6076M 44 pool5 5 41564M 89.68 4783M 41 poolDef 6 0 0 2417M 0 # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -18 0.09799 root profil5 -16 0.04900 host dhcp-126-125-profil5 3 0.04900 osd.3 up 1.00000 1.00000 -17 0.04900 host dhcp-126-126-profil5 10 0.04900 osd.10 up 1.00000 1.00000 -15 0.09799 root profil4 -13 0.04900 host dhcp-126-125-profil4 4 0.04900 osd.4 up 1.00000 1.00000 -14 0.04900 host dhcp-126-126-profil4 6 0.04900 osd.6 up 1.00000 1.00000 -12 0.09799 root profil3 -10 0.04900 host dhcp-126-125-profil3 1 0.04900 osd.1 up 1.00000 1.00000 -11 0.04900 host dhcp-126-126-profil3 11 0.04900 osd.11 up 1.00000 1.00000 -9 0.09799 root profil2 -7 0.04900 host dhcp-126-125-profil2 5 0.04900 osd.5 up 1.00000 1.00000 -8 0.04900 host dhcp-126-126-profil2 8 0.04900 osd.8 up 1.00000 1.00000 -6 0.09799 root profil1 -4 0.04900 host dhcp-126-125-profil1 2 0.04900 osd.2 up 1.00000 1.00000 -5 0.04900 host dhcp-126-126-profil1 9 0.04900 osd.9 up 1.00000 1.00000 -1 0.09799 root default -2 0.04900 host dhcp-126-125 0 0.04900 osd.0 up 1.00000 1.00000 -3 0.04900 host dhcp-126-126 7 0.04900 osd.7 up 1.00000 1.00000 ### all other pools have different ruleset ### # ceph osd pool get poolDef crush_ruleset crush_ruleset: 0 # ceph osd crush rule dump replicated_ruleset { "rule_id": 0, "rule_name": "replicated_ruleset", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } Actual results: Even those two disks that are part of the default storage profile are filled. Expected results: No other than proper disks are filled. Additional info:
This is correct (if weird) behavior. The rulesets are created correctly, but size is set to 4, which is more than the rulesets can actually map. It seems that the pools were originally created with the default ruleset, and then changed to use the custom rulesets. For some of the pgs in the size=4 pools, the primary kept the old 2 osds mapped as well from the default pool since it couldn't delete them without going clean first (which it can't do, since it doesn't have 4 osds...). This is odd behavior, but it's more understandable if you imagine that the pool had a bunch of data in it already. Users would be unhappy if we deleted the old copies before we it replicated over to 4 new osds.