Description of problem ====================== When I install and configure stretched ceph cluster as explained in ODF MetroDR docs, the reported MAX AVAIL value of all pools is 2 time smaller than expected value: with N GiB raw storage, the cluster reports that it can hold just N/8 GiB of data instead of expected N/4 GiB. When I try to write MAX AVAIL amount of data into the cluster, it turns out that the problem is with value of MAX AVAIL and that it's actually possible to utilize expected cluster capacity. That said, this is a problem for monitoring, as it's not possible to rely on MAX AVAIL value and see how much data one can store in the cluster. This could be a problem with ODF MetroDR stretched setup docs, or a bug in RHCS ceph related to stretched mode, or a combination of both. Version-Release number of selected component ============================================ ODF 4.11 MetroDR doc draft Red Hat Ansible 2.9 for RHEL 8 RHCEPH-5.2-RHEL-8-20220610.ci.1 ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable) How reproducible ================ 4/4 Steps to Reproduce ================== 1. Deploy 7 RHEL 8.5 machines, 6 of them with 2 local block devices (for OSDs) 2. Install ceph cluster on these machines, as explained in sections "3.4. Node pre-deployment requirements" and "3.5. Cluster bootstrapping and service deployment with Cephadm" of ODF MetroDR docs. 3. Check health and status of ceph cluster via: `ceph -s` and `ceph df` 4. Enable stretch mode as explained chapter "4. Configuring Red Hat Ceph Storage stretch cluster" of ODF MetroDR docs. 5. Check health and status of ceph cluster via: `ceph -s` and `ceph df` Actual results ============== When the ceph cluster is installed, and I checked the status for the 1st time (step #3), I see that cluster is healthy and reported MAX AVAIL values are within expectations: ``` [root@osd-0 ~]# ceph -s cluster: id: 95745a4e-f2f3-11ec-be2a-0050568f082e health: HEALTH_OK services: mon: 5 daemons, quorum osd-0,osd-1,osd-3,osd-4,arbiter (age 6m) mgr: osd-0.szcoki(active, since 27m), standbys: osd-3.lhrhtt osd: 12 osds: 12 up (since 5m), 12 in (since 5m) rgw: 2 daemons active (2 hosts, 1 zones) data: pools: 5 pools, 129 pgs objects: 191 objects, 5.3 KiB usage: 76 MiB used, 192 GiB / 192 GiB avail pgs: 129 active+clean [root@osd-0 ~]# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 192 GiB 192 GiB 76 MiB 76 MiB 0.04 TOTAL 192 GiB 192 GiB 76 MiB 76 MiB 0.04 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 0 0 B 0 61 GiB .rgw.root 2 32 1.3 KiB 4 48 KiB 0 61 GiB default.rgw.log 3 32 3.6 KiB 177 408 KiB 0 61 GiB default.rgw.control 4 32 0 B 8 0 B 0 61 GiB default.rgw.meta 5 32 382 B 2 24 KiB 0 61 GiB ``` This is because with 3 way replication and 192 GiB of total RAW storage, we expect MAX AVAIL on empty cluster to be near 192/3 GiB = 64.0 GiB, which is the case: 61 GiB is close enough. But when I do the same check after the stretch mode is enabled, I see that MAX AVAIL value of pools is much smaller: ``` [root@osd-0 ~]# ceph -s cluster: id: 95745a4e-f2f3-11ec-be2a-0050568f082e health: HEALTH_OK services: mon: 5 daemons, quorum osd-0,osd-1,osd-3,osd-4,arbiter (age 55m) mgr: osd-0.szcoki(active, since 27h), standbys: osd-3.lhrhtt mds: 1/1 daemons up, 3 standby osd: 12 osds: 12 up (since 26h), 12 in (since 26h) rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 8 pools, 225 pgs objects: 262 objects, 7.5 KiB usage: 498 MiB used, 191 GiB / 192 GiB avail pgs: 225 active+clean [root@osd-0 ~]# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 192 GiB 191 GiB 498 MiB 498 MiB 0.25 TOTAL 192 GiB 191 GiB 498 MiB 498 MiB 0.25 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 17 0 B 0 23 GiB .rgw.root 2 32 1.3 KiB 4 64 KiB 0 23 GiB default.rgw.log 3 32 3.6 KiB 209 544 KiB 0 23 GiB default.rgw.control 4 32 0 B 8 0 B 0 23 GiB default.rgw.meta 5 32 382 B 2 32 KiB 0 23 GiB rbdpool 6 32 0 B 0 0 B 0 23 GiB cephfs.cephfs.meta 7 32 2.3 KiB 22 128 KiB 0 23 GiB cephfs.cephfs.data 8 32 0 B 0 0 B 0 23 GiB ``` This no longer adds up, since stretch mode uses 4 replication, and so the MAX AVAIL value should be close to 192/4 GiB = 48.0 GiB, but instead it's 2 times smaller. This means that for 192 GiB raw storage (as attached to the storage machines), the cluster reports that it can hold just 23 GiB of data. Expected results ================ When stretch mode is enabled, MAX AVAIL value is close to 192/4 GiB = 48.0 GiB. Additional info =============== Cluster structure: ``` # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.18713 root default -3 0.09357 datacenter DC1 -2 0.03119 host osd-0 0 hdd 0.01559 osd.0 up 1.00000 1.00000 3 hdd 0.01559 osd.3 up 1.00000 1.00000 -4 0.03119 host osd-1 1 hdd 0.01559 osd.1 up 1.00000 1.00000 2 hdd 0.01559 osd.2 up 1.00000 1.00000 -5 0.03119 host osd-2 7 hdd 0.01559 osd.7 up 1.00000 1.00000 11 hdd 0.01559 osd.11 up 1.00000 1.00000 -7 0.09357 datacenter DC2 -6 0.03119 host osd-3 4 hdd 0.01559 osd.4 up 1.00000 1.00000 8 hdd 0.01559 osd.8 up 1.00000 1.00000 -8 0.03119 host osd-4 5 hdd 0.01559 osd.5 up 1.00000 1.00000 9 hdd 0.01559 osd.9 up 1.00000 1.00000 -9 0.03119 host osd-5 6 hdd 0.01559 osd.6 up 1.00000 1.00000 10 hdd 0.01559 osd.10 up 1.00000 1.00000 # ceph orch host ls HOST ADDR LABELS STATUS arbiter 10.1.160.104 mon osd-0 10.1.161.79 _admin osd mon mgr osd-1 10.1.160.73 osd mon osd-2 10.1.160.74 osd mds rgw osd-3 10.1.160.63 osd mon mgr osd-4 10.1.161.82 osd mon osd-5 10.1.160.236 osd mds rgw ``` Stretch mode is enabled: ``` # ceph mon dump epoch 12 fsid 95745a4e-f2f3-11ec-be2a-0050568f082e last_changed 2022-06-24T15:10:27.957502+0000 created 2022-06-23T12:54:30.748110+0000 min_mon_release 16 (pacific) election_strategy: 3 stretch_mode_enabled 1 tiebreaker_mon arbiter disallowed_leaders arbiter 0: [v2:10.1.161.79:3300/0,v1:10.1.161.79:6789/0] mon.osd-0; crush_location {datacenter=DC1} 1: [v2:10.1.160.73:3300/0,v1:10.1.160.73:6789/0] mon.osd-1; crush_location {datacenter=DC1} 2: [v2:10.1.160.63:3300/0,v1:10.1.160.63:6789/0] mon.osd-3; crush_location {datacenter=DC2} 3: [v2:10.1.161.82:3300/0,v1:10.1.161.82:6789/0] mon.osd-4; crush_location {datacenter=DC2} 4: [v2:10.1.160.104:3300/0,v1:10.1.160.104:6789/0] mon.arbiter; crush_location {datacenter=DC3} dumped monmap epoch 12 ``` See also attached: - `cluster-spec.yaml` with cluster specification for ceph adm orchestrator - script `cluster-install.sh` with cepmadm command to isntall the cluster - script `cluster-postinstall.sh` with additional post install steps (creating RBD pool and CephFS volume) - script `cluster-stretchsetup.sh` with commands to enable stretch setup And further debug details: - output of `ceph osd dump` - plaintext version of CRUSH map dump Additional experiment ===================== For testing purpose, I'm going to create 22 GB rbd image, which is close to reported MAX AVAIL, and mount directly on admin/bootstrap storage machine so that I can try to write data there: ``` # ceph osd pool create rbdtest 32 32 pool 'rbdtest' created # ceph osd pool application enable rbdtest rbd enabled application 'rbd' on pool 'rbdtest' # rbd pool init -p rbdtest # rbd create data --size 22528 --pool rbdtest # ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 192 GiB 191 GiB 496 MiB 496 MiB 0.25 TOTAL 192 GiB 191 GiB 496 MiB 496 MiB 0.25 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 17 0 B 0 23 GiB .rgw.root 2 32 1.3 KiB 4 64 KiB 0 23 GiB default.rgw.log 3 32 3.6 KiB 209 544 KiB 0 23 GiB default.rgw.control 4 32 0 B 8 0 B 0 23 GiB default.rgw.meta 5 32 382 B 2 32 KiB 0 23 GiB rbdpool 6 32 0 B 0 0 B 0 23 GiB cephfs.cephfs.meta 7 32 2.3 KiB 22 128 KiB 0 23 GiB cephfs.cephfs.data 8 32 0 B 0 0 B 0 23 GiB rbdtest 9 32 1.4 KiB 5 48 KiB 0 23 GiB # rbd map rbdtest/data --name client.admin /dev/rbd0 # mkfs.xfs /dev/rbd0 meta-data=/dev/rbd0 isize=512 agcount=16, agsize=360448 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 bigtime=0 inobtcount=0 data = bsize=4096 blocks=5767168, imaxpct=25 = sunit=16 swidth=16 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=2816, version=2 = sectsz=512 sunit=16 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Discarding blocks...Done. # mkdir /mnt/test # mount -t xfs /dev/rbd0 /mnt/test # df -h /mnt/test Filesystem Size Used Avail Use% Mounted on /dev/rbd0 22G 191M 22G 1% /mnt/test ``` Then I write 10G there: ``` # dd if=/dev/zero of=/mnt/test/10G bs=1G count=10 10+0 records in 10+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 25.6991 s, 418 MB/s # df -h /mnt/test/ Filesystem Size Used Avail Use% Mounted on /dev/rbd0 22G 11G 12G 47% /mnt/test # ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 192 GiB 151 GiB 41 GiB 41 GiB 21.26 TOTAL 192 GiB 151 GiB 41 GiB 41 GiB 21.26 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 17 0 B 0 17 GiB .rgw.root 2 32 1.3 KiB 4 64 KiB 0 17 GiB default.rgw.log 3 32 3.6 KiB 209 544 KiB 0 17 GiB default.rgw.control 4 32 0 B 8 0 B 0 17 GiB default.rgw.meta 5 32 382 B 2 32 KiB 0 17 GiB rbdpool 6 32 0 B 0 0 B 0 17 GiB cephfs.cephfs.meta 7 32 2.3 KiB 22 128 KiB 0 17 GiB cephfs.cephfs.data 8 32 0 B 0 0 B 0 17 GiB rbdtest 9 32 10 GiB 2.58k 40 GiB 37.54 17 GiB ``` And we see that after writing 10GiB to the rbdtest pool, the MAX AVAIL is 6 GiB smaller. Which is bit weird, I would understand if the decrease is smaller than stored data, but not vice versa. We can also see that the cluster actually uses 4 replication on the RAW USED value (10 GiB x 4). When utilize the testing rbd volume 100%: ``` # dd if=/dev/zero of=/mnt/test/full bs=1G count=20 dd: error writing '/mnt/test/full': No space left on device 12+0 records in 11+0 records out 12673089536 bytes (13 GB, 12 GiB) copied, 37.1448 s, 341 MB/s ``` I see that MAX AVAIL is now reporting 8.5 GiB available: ``` # ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 192 GiB 102 GiB 90 GiB 90 GiB 46.94 TOTAL 192 GiB 102 GiB 90 GiB 90 GiB 46.94 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 17 0 B 0 8.5 GiB .rgw.root 2 32 1.3 KiB 4 64 KiB 0 8.5 GiB default.rgw.log 3 32 3.6 KiB 209 544 KiB 0 8.5 GiB default.rgw.control 4 32 0 B 8 0 B 0 8.5 GiB default.rgw.meta 5 32 382 B 2 32 KiB 0 8.5 GiB rbdpool 6 32 0 B 0 0 B 0 8.5 GiB cephfs.cephfs.meta 7 32 2.3 KiB 22 128 KiB 0 8.5 GiB cephfs.cephfs.data 8 32 0 B 0 0 B 0 8.5 GiB rbdtest 9 128 22 GiB 5.59k 89 GiB 72.25 8.6 GiB ``` Which is good, as this means that we can actually utilize the cluster as expected, we just can't rely on value of MAX AVAIL.
Created attachment 1892557 [details] cluster crushmap
Created attachment 1892558 [details] cluster-stretchsetup.sh script
Created attachment 1892561 [details] ceph osd dump
Created attachment 1892562 [details] cluster post install script
This has an impact on ODF UI[1] (commit 21905d44f4), since I see that we use ceph_pool_max_avail metric in 2 cases: ``` $ rg --context 1 ceph_pool_max_avail packages/ocs/queries/ceph-storage.ts 62- [StorageDashboardQuery.CEPH_CAPACITY_AVAILABLE]: 63: 'max(ceph_pool_max_avail * on (pool_id) group_left(name)ceph_pool_metadata{name=~"(.*file.*)|(.*block.*)"})', 64-}; -- 196- [StorageDashboardQuery.POOL_RAW_CAPACITY_USED]: `ceph_pool_bytes_used * on (pool_id) group_left(name)ceph_pool_metadata{name=~'${names}'}`, 197: [StorageDashboardQuery.POOL_MAX_CAPACITY_AVAILABLE]: `ceph_pool_max_avail * on (pool_id) group_left(name)ceph_pool_metadata{name=~'${names}'}`, 198- [StorageDashboardQuery.POOL_UTILIZATION_IOPS_QUERY]: `(rate(ceph_pool_wr[1m]) + rate(ceph_pool_rd[1m])) * on (pool_id) group_left(name)ceph_pool_metadata{name=~'${names}'}`, ``` [1] https://github.com/red-hat-storage/odf-console
There seems to be no impact on ODF alerting.
This will make most of metrics/alerting tests from ocs-ci fail/timeout, as these tests rely on MAX AVAIL value (to figure out how much data to write to get cluster utilization to a given level).
@gfarnum are you still the expert on the stretch cluster?
These stats are generated by the PGMap in the mgr. I took a brief look and am not sure what's causing this, though I imagine it's something about the two CRUSH roots and two "take" clauses in the CRUSH rule? Handing it off to Neha and the RADOS team.
Had a discussion with Prashant and he will take a look.
When I deploy ODF cluster in arbiter mode (so that stretched ceph setup fully managed by ODF), I don't see this issue: ``` bash-4.4$ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 192 GiB 191 GiB 638 MiB 638 MiB 0.32 TOTAL 192 GiB 191 GiB 638 MiB 638 MiB 0.32 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 0 0 B 0 41 GiB ocs-storagecluster-cephblockpool 2 256 66 MiB 70 264 MiB 0.16 41 GiB .rgw.root 3 32 4.8 KiB 16 240 KiB 0 41 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.index 4 32 0 B 22 0 B 0 41 GiB ocs-storagecluster-cephobjectstore.rgw.control 5 32 0 B 8 0 B 0 41 GiB ocs-storagecluster-cephobjectstore.rgw.meta 6 32 3.9 KiB 16 224 KiB 0 41 GiB ocs-storagecluster-cephobjectstore.rgw.log 7 32 23 KiB 308 2.5 MiB 0 41 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 8 32 0 B 0 0 B 0 41 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.data 9 32 1 KiB 1 16 KiB 0 41 GiB ocs-storagecluster-cephfilesystem-metadata 10 32 2.3 KiB 22 128 KiB 0 41 GiB ocs-storagecluster-cephfilesystem-data0 11 32 0 B 0 0 B 0 41 GiB ``` This is with ODF 4.11.0-113, which includes ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable). So maybe we have a problem in the installation instructions? I will compare the crush rules.
Hi Martin, The problem is not with the stretch mode cluster but rather with crush rule stretch_rule. rule stretch_rule { id 1 type replicated step take DC1 step chooseleaf firstn 2 type host step emit step take DC2 step chooseleaf firstn 2 type host step emit } If we change stretch_rule to below, then "MAX AVAIL" showing correct value : rule stretch_replicated_rule { id 2 type replicated step take default step choose firstn 0 type datacenter step chooseleaf firstn 2 type host step emit } The way crush rule stretch_rule is defined in your case, PGMap::get_rule_avail is considering only one datacenter's available size rather than total avail size (avail size = total-avail-size-from-both-dc/replication-size) from both datacenters. More details : $ ceph osd crush rule ls replicated_rule stretch_rule stretch_replicated_rule $ ceph osd crush rule dump stretch_rule { "rule_id": 1, "rule_name": "stretch_rule", "type": 1, "steps": [ { "op": "take", "item": -5, "item_name": "DC1" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" }, { "op": "take", "item": -6, "item_name": "DC2" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" } ] } $ ceph osd crush rule dump stretch_replicated_rule { "rule_id": 2, "rule_name": "stretch_replicated_rule", "type": 1, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "choose_firstn", "num": 0, "type": "datacenter" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" } ] } $ ceph osd pool ls detail pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 88 lfor 0/0/62 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 64 lfor 0/0/62 flags hashpspool stripe_width 0 application cephfs pool 4 'rbdpool' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 126 flags hashpspool stripe_width 0 pool 5 'rbdtest' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 139 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 6 'stretched_rbdpool' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 130 flags hashpspool stripe_width 0 pool 7 'stretched_rbdtest' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 143 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 8 'stretched_replicated_rbdpool' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 134 flags hashpspool stripe_width 0 pool 9 'stretched_replicated_rbdtest' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 147 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd $ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 1.2 TiB 960 GiB 252 GiB 252 GiB 20.81 TOTAL 1.2 TiB 960 GiB 252 GiB 252 GiB 20.81 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 1 1 1.5 MiB 2 4.5 MiB 0 289 GiB cephfs.a.meta 2 16 2.3 KiB 22 96 KiB 0 289 GiB cephfs.a.data 3 32 0 B 0 0 B 0 289 GiB rbdpool 4 32 0 B 0 0 B 0 216 GiB rbdtest 5 32 20 GiB 5.14k 80 GiB 8.46 216 GiB stretched_rbdpool 6 32 0 B 0 0 B 0 108 GiB stretched_rbdtest 7 32 20 GiB 5.14k 80 GiB 15.60 108 GiB stretched_replicated_rbdpool 8 32 0 B 0 0 B 0 216 GiB stretched_replicated_rbdtest 9 32 20 GiB 5.14k 80 GiB 8.46 216 GiB I will investigate crush rule used by you further and see if we need to change PGMap code to fix the way available size is getting calculated for crush rule stretch_rule.
Okay. The CrushWrapper::get_rule_weight_osd_map needs fix as it has clear FIXME message on why it is ignoring 2 takes from stretch_rule which places objects on 2 hosts from each DC. int CrushWrapper::get_rule_weight_osd_map(unsigned ruleno, map<int,float> *pmap) const { if (ruleno >= crush->max_rules) return -ENOENT; if (crush->rules[ruleno] == NULL) return -ENOENT; crush_rule *rule = crush->rules[ruleno]; // build a weight map for each TAKE in the rule, and then merge them // FIXME: if there are multiple takes that place a different number of // objects we do not take that into account. (Also, note that doing this // right is also a function of the pool, since the crush rule // might choose 2 + choose 2 but pool size may only be 3.) for (unsigned i=0; i<rule->len; ++i) { map<int,float> m; float sum = 0; if (rule->steps[i].op == CRUSH_RULE_TAKE) { int n = rule->steps[i].arg1; if (n >= 0) { m[n] = 1.0; sum = 1.0; } else { sum += _get_take_weight_osd_map(n, &m); } } _normalize_weight_map(sum, m, pmap); } return 0; }
(In reply to Prashant Dhange from comment #18) > Okay. The CrushWrapper::get_rule_weight_osd_map needs fix as it has clear > FIXME message on why it is ignoring 2 takes from stretch_rule which places > objects on 2 hosts from each DC. > > int CrushWrapper::get_rule_weight_osd_map(unsigned ruleno, > map<int,float> *pmap) const > { > if (ruleno >= crush->max_rules) > return -ENOENT; > if (crush->rules[ruleno] == NULL) > return -ENOENT; > crush_rule *rule = crush->rules[ruleno]; > > // build a weight map for each TAKE in the rule, and then merge them > > // FIXME: if there are multiple takes that place a different number of > // objects we do not take that into account. (Also, note that doing this > // right is also a function of the pool, since the crush rule > // might choose 2 + choose 2 but pool size may only be 3.) > for (unsigned i=0; i<rule->len; ++i) { > map<int,float> m; > float sum = 0; > if (rule->steps[i].op == CRUSH_RULE_TAKE) { > int n = rule->steps[i].arg1; > if (n >= 0) { > m[n] = 1.0; > sum = 1.0; > } else { > sum += _get_take_weight_osd_map(n, &m); > } > } > _normalize_weight_map(sum, m, pmap); > } > > return 0; > } Good work, Prashant. The moment you confirm it is a bug and you think a fix is needed maybe clone an RHCS bug for this ODF bug.
Good catch with the stretched crush rule. When I inspected rules in crush map on a stretched ceph cluster managed by ODF, I noticed that indeed the rules differ: ``` # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule default_stretch_cluster_rule { id 1 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type zone step chooseleaf firstn 2 type host step emit } ``` Which explains why I don't see the problem there. That said I'm not sure why ODF is using different stretch rule. In our MetroDR guide related to stretched ceph, we adopted the suggestion from upstream if I recall right: https://docs.ceph.com/en/latest/rados/operations/stretch-mode/
(In reply to Vikhyat Umrao from comment #19) > (In reply to Prashant Dhange from comment #18) > > Okay. The CrushWrapper::get_rule_weight_osd_map needs fix as it has clear > > FIXME message on why it is ignoring 2 takes from stretch_rule which places > > objects on 2 hosts from each DC. ... ... > > Good work, Prashant. The moment you confirm it is a bug and you think a fix > is needed maybe clone an RHCS bug for this ODF bug. I have cloned this BZ to RHCS bug https://bugzilla.redhat.com/show_bug.cgi?id=2109129
(In reply to Martin Bukatovic from comment #20) > Good catch with the stretched crush rule. > > When I inspected rules in crush map on a stretched ceph cluster managed by > ODF, I noticed that indeed the rules differ: > > ``` > # rules > rule replicated_rule { > id 0 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type host > step emit > } > rule default_stretch_cluster_rule { > id 1 > type replicated > min_size 1 > max_size 10 > step take default > step choose firstn 0 type zone > step chooseleaf firstn 2 type host > step emit > } > ``` > > Which explains why I don't see the problem there. > > That said I'm not sure why ODF is using different stretch rule. In our > MetroDR guide related to stretched ceph, we adopted the suggestion from > upstream if I recall right: > > https://docs.ceph.com/en/latest/rados/operations/stretch-mode/ Thanks Martin. I have opened upstream PR#47189 to fix this inconsistency (more details on RHCS bug BZ#2109129).
Not a 4.11 blocker
Ceph BZ is targeted for 6.1