Bug 2100920
Summary: | [MetroDR] ceph df reports invalid MAX AVAIL value when the cluster is in stretch mode | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Martin Bukatovic <mbukatov> | ||||||||||
Component: | ceph | Assignee: | Prashant Dhange <pdhange> | ||||||||||
ceph sub component: | RADOS | QA Contact: | Elad <ebenahar> | ||||||||||
Status: | ASSIGNED --- | Docs Contact: | |||||||||||
Severity: | medium | ||||||||||||
Priority: | unspecified | CC: | bniver, gfarnum, muagarwa, odf-bz-bot, olakra, pdhange, pdhiran, rzarzyns, sheggodu, vumrao | ||||||||||
Version: | 4.11 | ||||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Known Issue | |||||||||||
Doc Text: |
.`ceph df` reports invalid MAX AVAIL value when the cluster is in stretch mode
When a crush rule for an RHCS cluster has multiple "take" steps, the `ceph df` report shows the wrong maximum available size for the map. The issue will be fixed in an upcoming release.
|
Story Points: | --- | ||||||||||
Clone Of: | |||||||||||||
: | 2109129 (view as bug list) | Environment: | |||||||||||
Last Closed: | Type: | Bug | |||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | 2109129 | ||||||||||||
Bug Blocks: | |||||||||||||
Attachments: |
|
Description
Martin Bukatovic
2022-06-24 17:01:10 UTC
Created attachment 1892557 [details]
cluster crushmap
Created attachment 1892558 [details]
cluster-stretchsetup.sh script
Created attachment 1892561 [details]
ceph osd dump
Created attachment 1892562 [details]
cluster post install script
This has an impact on ODF UI[1] (commit 21905d44f4), since I see that we use ceph_pool_max_avail metric in 2 cases: ``` $ rg --context 1 ceph_pool_max_avail packages/ocs/queries/ceph-storage.ts 62- [StorageDashboardQuery.CEPH_CAPACITY_AVAILABLE]: 63: 'max(ceph_pool_max_avail * on (pool_id) group_left(name)ceph_pool_metadata{name=~"(.*file.*)|(.*block.*)"})', 64-}; -- 196- [StorageDashboardQuery.POOL_RAW_CAPACITY_USED]: `ceph_pool_bytes_used * on (pool_id) group_left(name)ceph_pool_metadata{name=~'${names}'}`, 197: [StorageDashboardQuery.POOL_MAX_CAPACITY_AVAILABLE]: `ceph_pool_max_avail * on (pool_id) group_left(name)ceph_pool_metadata{name=~'${names}'}`, 198- [StorageDashboardQuery.POOL_UTILIZATION_IOPS_QUERY]: `(rate(ceph_pool_wr[1m]) + rate(ceph_pool_rd[1m])) * on (pool_id) group_left(name)ceph_pool_metadata{name=~'${names}'}`, ``` [1] https://github.com/red-hat-storage/odf-console There seems to be no impact on ODF alerting. This will make most of metrics/alerting tests from ocs-ci fail/timeout, as these tests rely on MAX AVAIL value (to figure out how much data to write to get cluster utilization to a given level). @gfarnum are you still the expert on the stretch cluster? These stats are generated by the PGMap in the mgr. I took a brief look and am not sure what's causing this, though I imagine it's something about the two CRUSH roots and two "take" clauses in the CRUSH rule? Handing it off to Neha and the RADOS team. Had a discussion with Prashant and he will take a look. When I deploy ODF cluster in arbiter mode (so that stretched ceph setup fully managed by ODF), I don't see this issue: ``` bash-4.4$ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 192 GiB 191 GiB 638 MiB 638 MiB 0.32 TOTAL 192 GiB 191 GiB 638 MiB 638 MiB 0.32 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 0 0 B 0 41 GiB ocs-storagecluster-cephblockpool 2 256 66 MiB 70 264 MiB 0.16 41 GiB .rgw.root 3 32 4.8 KiB 16 240 KiB 0 41 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.index 4 32 0 B 22 0 B 0 41 GiB ocs-storagecluster-cephobjectstore.rgw.control 5 32 0 B 8 0 B 0 41 GiB ocs-storagecluster-cephobjectstore.rgw.meta 6 32 3.9 KiB 16 224 KiB 0 41 GiB ocs-storagecluster-cephobjectstore.rgw.log 7 32 23 KiB 308 2.5 MiB 0 41 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 8 32 0 B 0 0 B 0 41 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.data 9 32 1 KiB 1 16 KiB 0 41 GiB ocs-storagecluster-cephfilesystem-metadata 10 32 2.3 KiB 22 128 KiB 0 41 GiB ocs-storagecluster-cephfilesystem-data0 11 32 0 B 0 0 B 0 41 GiB ``` This is with ODF 4.11.0-113, which includes ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable). So maybe we have a problem in the installation instructions? I will compare the crush rules. Hi Martin, The problem is not with the stretch mode cluster but rather with crush rule stretch_rule. rule stretch_rule { id 1 type replicated step take DC1 step chooseleaf firstn 2 type host step emit step take DC2 step chooseleaf firstn 2 type host step emit } If we change stretch_rule to below, then "MAX AVAIL" showing correct value : rule stretch_replicated_rule { id 2 type replicated step take default step choose firstn 0 type datacenter step chooseleaf firstn 2 type host step emit } The way crush rule stretch_rule is defined in your case, PGMap::get_rule_avail is considering only one datacenter's available size rather than total avail size (avail size = total-avail-size-from-both-dc/replication-size) from both datacenters. More details : $ ceph osd crush rule ls replicated_rule stretch_rule stretch_replicated_rule $ ceph osd crush rule dump stretch_rule { "rule_id": 1, "rule_name": "stretch_rule", "type": 1, "steps": [ { "op": "take", "item": -5, "item_name": "DC1" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" }, { "op": "take", "item": -6, "item_name": "DC2" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" } ] } $ ceph osd crush rule dump stretch_replicated_rule { "rule_id": 2, "rule_name": "stretch_replicated_rule", "type": 1, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "choose_firstn", "num": 0, "type": "datacenter" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" } ] } $ ceph osd pool ls detail pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 88 lfor 0/0/62 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 64 lfor 0/0/62 flags hashpspool stripe_width 0 application cephfs pool 4 'rbdpool' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 126 flags hashpspool stripe_width 0 pool 5 'rbdtest' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 139 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 6 'stretched_rbdpool' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 130 flags hashpspool stripe_width 0 pool 7 'stretched_rbdtest' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 143 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 8 'stretched_replicated_rbdpool' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 134 flags hashpspool stripe_width 0 pool 9 'stretched_replicated_rbdtest' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 147 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd $ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 1.2 TiB 960 GiB 252 GiB 252 GiB 20.81 TOTAL 1.2 TiB 960 GiB 252 GiB 252 GiB 20.81 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 1 1 1.5 MiB 2 4.5 MiB 0 289 GiB cephfs.a.meta 2 16 2.3 KiB 22 96 KiB 0 289 GiB cephfs.a.data 3 32 0 B 0 0 B 0 289 GiB rbdpool 4 32 0 B 0 0 B 0 216 GiB rbdtest 5 32 20 GiB 5.14k 80 GiB 8.46 216 GiB stretched_rbdpool 6 32 0 B 0 0 B 0 108 GiB stretched_rbdtest 7 32 20 GiB 5.14k 80 GiB 15.60 108 GiB stretched_replicated_rbdpool 8 32 0 B 0 0 B 0 216 GiB stretched_replicated_rbdtest 9 32 20 GiB 5.14k 80 GiB 8.46 216 GiB I will investigate crush rule used by you further and see if we need to change PGMap code to fix the way available size is getting calculated for crush rule stretch_rule. Okay. The CrushWrapper::get_rule_weight_osd_map needs fix as it has clear FIXME message on why it is ignoring 2 takes from stretch_rule which places objects on 2 hosts from each DC. int CrushWrapper::get_rule_weight_osd_map(unsigned ruleno, map<int,float> *pmap) const { if (ruleno >= crush->max_rules) return -ENOENT; if (crush->rules[ruleno] == NULL) return -ENOENT; crush_rule *rule = crush->rules[ruleno]; // build a weight map for each TAKE in the rule, and then merge them // FIXME: if there are multiple takes that place a different number of // objects we do not take that into account. (Also, note that doing this // right is also a function of the pool, since the crush rule // might choose 2 + choose 2 but pool size may only be 3.) for (unsigned i=0; i<rule->len; ++i) { map<int,float> m; float sum = 0; if (rule->steps[i].op == CRUSH_RULE_TAKE) { int n = rule->steps[i].arg1; if (n >= 0) { m[n] = 1.0; sum = 1.0; } else { sum += _get_take_weight_osd_map(n, &m); } } _normalize_weight_map(sum, m, pmap); } return 0; } (In reply to Prashant Dhange from comment #18) > Okay. The CrushWrapper::get_rule_weight_osd_map needs fix as it has clear > FIXME message on why it is ignoring 2 takes from stretch_rule which places > objects on 2 hosts from each DC. > > int CrushWrapper::get_rule_weight_osd_map(unsigned ruleno, > map<int,float> *pmap) const > { > if (ruleno >= crush->max_rules) > return -ENOENT; > if (crush->rules[ruleno] == NULL) > return -ENOENT; > crush_rule *rule = crush->rules[ruleno]; > > // build a weight map for each TAKE in the rule, and then merge them > > // FIXME: if there are multiple takes that place a different number of > // objects we do not take that into account. (Also, note that doing this > // right is also a function of the pool, since the crush rule > // might choose 2 + choose 2 but pool size may only be 3.) > for (unsigned i=0; i<rule->len; ++i) { > map<int,float> m; > float sum = 0; > if (rule->steps[i].op == CRUSH_RULE_TAKE) { > int n = rule->steps[i].arg1; > if (n >= 0) { > m[n] = 1.0; > sum = 1.0; > } else { > sum += _get_take_weight_osd_map(n, &m); > } > } > _normalize_weight_map(sum, m, pmap); > } > > return 0; > } Good work, Prashant. The moment you confirm it is a bug and you think a fix is needed maybe clone an RHCS bug for this ODF bug. Good catch with the stretched crush rule. When I inspected rules in crush map on a stretched ceph cluster managed by ODF, I noticed that indeed the rules differ: ``` # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule default_stretch_cluster_rule { id 1 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type zone step chooseleaf firstn 2 type host step emit } ``` Which explains why I don't see the problem there. That said I'm not sure why ODF is using different stretch rule. In our MetroDR guide related to stretched ceph, we adopted the suggestion from upstream if I recall right: https://docs.ceph.com/en/latest/rados/operations/stretch-mode/ (In reply to Vikhyat Umrao from comment #19) > (In reply to Prashant Dhange from comment #18) > > Okay. The CrushWrapper::get_rule_weight_osd_map needs fix as it has clear > > FIXME message on why it is ignoring 2 takes from stretch_rule which places > > objects on 2 hosts from each DC. ... ... > > Good work, Prashant. The moment you confirm it is a bug and you think a fix > is needed maybe clone an RHCS bug for this ODF bug. I have cloned this BZ to RHCS bug https://bugzilla.redhat.com/show_bug.cgi?id=2109129 (In reply to Martin Bukatovic from comment #20) > Good catch with the stretched crush rule. > > When I inspected rules in crush map on a stretched ceph cluster managed by > ODF, I noticed that indeed the rules differ: > > ``` > # rules > rule replicated_rule { > id 0 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type host > step emit > } > rule default_stretch_cluster_rule { > id 1 > type replicated > min_size 1 > max_size 10 > step take default > step choose firstn 0 type zone > step chooseleaf firstn 2 type host > step emit > } > ``` > > Which explains why I don't see the problem there. > > That said I'm not sure why ODF is using different stretch rule. In our > MetroDR guide related to stretched ceph, we adopted the suggestion from > upstream if I recall right: > > https://docs.ceph.com/en/latest/rados/operations/stretch-mode/ Thanks Martin. I have opened upstream PR#47189 to fix this inconsistency (more details on RHCS bug BZ#2109129). Not a 4.11 blocker Ceph BZ is targeted for 6.1 |