Bug 2100920

Summary:

[MetroDR] ceph df reports invalid MAX AVAIL value when the cluster is in stretch mode

Product:

[Red Hat Storage] Red Hat OpenShift Data Foundation

Reporter:

Martin Bukatovic <mbukatov>

Component:

ceph

Assignee:

Prashant Dhange <pdhange>

ceph sub component:

RADOS

QA Contact:

Elad <ebenahar>

Status:

ASSIGNED ---

Docs Contact:

Severity:

medium

Priority:

unspecified

CC:

bniver, gfarnum, muagarwa, odf-bz-bot, olakra, pdhange, pdhiran, rzarzyns, sheggodu, vumrao

Version:

4.11

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Known Issue

Doc Text:

.`ceph df` reports invalid MAX AVAIL value when the cluster is in stretch mode When a crush rule for an RHCS cluster has multiple "take" steps, the `ceph df` report shows the wrong maximum available size for the map. The issue will be fixed in an upcoming release.

Story Points:

---

Clone Of:

Clones:

2109129 (view as bug list)

Environment:

Last Closed:

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

2109129

Bug Blocks:

Attachments:

Description	Flags
cluster crushmap	none
cluster-stretchsetup.sh script	none
ceph osd dump	none
cluster post install script	none

Description Martin Bukatovic 2022-06-24 17:01:10 UTC

Description of problem
======================

When I install and configure stretched ceph cluster as explained in ODF MetroDR
docs, the reported MAX AVAIL value of all pools is 2 time smaller than expected
value: with N GiB raw storage, the cluster reports that it can hold just N/8
GiB of data instead of expected N/4 GiB.

When I try to write MAX AVAIL amount of data into the cluster, it turns out
that the problem is with value of MAX AVAIL and that it's actually possible to
utilize expected cluster capacity. That said, this is a problem for monitoring,
as it's not possible to rely on MAX AVAIL value and see how much data one can
store in the cluster.

This could be a problem with ODF MetroDR stretched setup docs, or a bug in RHCS
ceph related to stretched mode, or a combination of both.

Version-Release number of selected component
============================================

ODF 4.11 MetroDR doc draft

Red Hat Ansible 2.9 for RHEL 8
RHCEPH-5.2-RHEL-8-20220610.ci.1
ceph version 16.2.8-42.el8cp (c15e56a8d2decae9230567653130d1e31a36fe0a) pacific (stable)

How reproducible
================

4/4

Steps to Reproduce
==================

1. Deploy 7 RHEL 8.5 machines, 6 of them with 2 local block devices (for OSDs)
2. Install ceph cluster on these machines, as explained in sections
   "3.4. Node pre-deployment requirements" and "3.5. Cluster bootstrapping and
   service deployment with Cephadm" of ODF MetroDR docs.
3. Check health and status of ceph cluster via: `ceph -s` and `ceph df`
4. Enable stretch mode as explained chapter "4. Configuring Red Hat Ceph
   Storage stretch cluster" of ODF MetroDR docs.
5. Check health and status of ceph cluster via: `ceph -s` and `ceph df`

Actual results
==============

When the ceph cluster is installed, and I checked the status for the 1st time
(step #3), I see that cluster is healthy and reported MAX AVAIL values are
within expectations:

```
[root@osd-0 ~]# ceph -s
  cluster:
    id:     95745a4e-f2f3-11ec-be2a-0050568f082e
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum osd-0,osd-1,osd-3,osd-4,arbiter (age 6m)
    mgr: osd-0.szcoki(active, since 27m), standbys: osd-3.lhrhtt
    osd: 12 osds: 12 up (since 5m), 12 in (since 5m)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    pools:   5 pools, 129 pgs
    objects: 191 objects, 5.3 KiB
    usage:   76 MiB used, 192 GiB / 192 GiB avail
    pgs:     129 active+clean
[root@osd-0 ~]# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
hdd    192 GiB  192 GiB  76 MiB    76 MiB       0.04
TOTAL  192 GiB  192 GiB  76 MiB    76 MiB       0.04

--- POOLS ---
POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics   1    1      0 B        0      0 B      0     61 GiB
.rgw.root               2   32  1.3 KiB        4   48 KiB      0     61 GiB
default.rgw.log         3   32  3.6 KiB      177  408 KiB      0     61 GiB
default.rgw.control     4   32      0 B        8      0 B      0     61 GiB
default.rgw.meta        5   32    382 B        2   24 KiB      0     61 GiB
```

This is because with 3 way replication and 192 GiB of total RAW storage, we
expect MAX AVAIL on empty cluster to be near 192/3 GiB = 64.0 GiB, which is
the case: 61 GiB is close enough.

But when I do the same check after the stretch mode is enabled, I see that
MAX AVAIL value of pools is much smaller:

```
[root@osd-0 ~]# ceph -s
  cluster:
    id:     95745a4e-f2f3-11ec-be2a-0050568f082e
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum osd-0,osd-1,osd-3,osd-4,arbiter (age 55m)
    mgr: osd-0.szcoki(active, since 27h), standbys: osd-3.lhrhtt
    mds: 1/1 daemons up, 3 standby
    osd: 12 osds: 12 up (since 26h), 12 in (since 26h)
    rgw: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   8 pools, 225 pgs
    objects: 262 objects, 7.5 KiB
    usage:   498 MiB used, 191 GiB / 192 GiB avail
    pgs:     225 active+clean
 
[root@osd-0 ~]# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    192 GiB  191 GiB  498 MiB   498 MiB       0.25
TOTAL  192 GiB  191 GiB  498 MiB   498 MiB       0.25
 
--- POOLS ---
POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics   1    1      0 B       17      0 B      0     23 GiB
.rgw.root               2   32  1.3 KiB        4   64 KiB      0     23 GiB
default.rgw.log         3   32  3.6 KiB      209  544 KiB      0     23 GiB
default.rgw.control     4   32      0 B        8      0 B      0     23 GiB
default.rgw.meta        5   32    382 B        2   32 KiB      0     23 GiB
rbdpool                 6   32      0 B        0      0 B      0     23 GiB
cephfs.cephfs.meta      7   32  2.3 KiB       22  128 KiB      0     23 GiB
cephfs.cephfs.data      8   32      0 B        0      0 B      0     23 GiB
```

This no longer adds up, since stretch mode uses 4 replication, and so the
MAX AVAIL value should be close to 192/4 GiB = 48.0 GiB, but instead it's
2 times smaller.

This means that for 192 GiB raw storage (as attached to the storage machines),
the cluster reports that it can hold just 23 GiB of data.

Expected results
================

When stretch mode is enabled, MAX AVAIL value is close to 192/4 GiB = 48.0 GiB.

Additional info
===============

Cluster structure:

```
# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
-1         0.18713  root default
-3         0.09357      datacenter DC1
-2         0.03119          host osd-0
 0    hdd  0.01559              osd.0       up   1.00000  1.00000
 3    hdd  0.01559              osd.3       up   1.00000  1.00000
-4         0.03119          host osd-1
 1    hdd  0.01559              osd.1       up   1.00000  1.00000
 2    hdd  0.01559              osd.2       up   1.00000  1.00000
-5         0.03119          host osd-2
 7    hdd  0.01559              osd.7       up   1.00000  1.00000
11    hdd  0.01559              osd.11      up   1.00000  1.00000
-7         0.09357      datacenter DC2
-6         0.03119          host osd-3
 4    hdd  0.01559              osd.4       up   1.00000  1.00000
 8    hdd  0.01559              osd.8       up   1.00000  1.00000
-8         0.03119          host osd-4
 5    hdd  0.01559              osd.5       up   1.00000  1.00000
 9    hdd  0.01559              osd.9       up   1.00000  1.00000
-9         0.03119          host osd-5
 6    hdd  0.01559              osd.6       up   1.00000  1.00000
10    hdd  0.01559              osd.10      up   1.00000  1.00000
# ceph orch host ls
HOST     ADDR          LABELS              STATUS
arbiter  10.1.160.104  mon
osd-0    10.1.161.79   _admin osd mon mgr
osd-1    10.1.160.73   osd mon
osd-2    10.1.160.74   osd mds rgw
osd-3    10.1.160.63   osd mon mgr
osd-4    10.1.161.82   osd mon
osd-5    10.1.160.236  osd mds rgw
```

Stretch mode is enabled:

```
# ceph mon dump
epoch 12
fsid 95745a4e-f2f3-11ec-be2a-0050568f082e
last_changed 2022-06-24T15:10:27.957502+0000
created 2022-06-23T12:54:30.748110+0000
min_mon_release 16 (pacific)
election_strategy: 3
stretch_mode_enabled 1
tiebreaker_mon arbiter
disallowed_leaders arbiter
0: [v2:10.1.161.79:3300/0,v1:10.1.161.79:6789/0] mon.osd-0; crush_location {datacenter=DC1}
1: [v2:10.1.160.73:3300/0,v1:10.1.160.73:6789/0] mon.osd-1; crush_location {datacenter=DC1}
2: [v2:10.1.160.63:3300/0,v1:10.1.160.63:6789/0] mon.osd-3; crush_location {datacenter=DC2}
3: [v2:10.1.161.82:3300/0,v1:10.1.161.82:6789/0] mon.osd-4; crush_location {datacenter=DC2}
4: [v2:10.1.160.104:3300/0,v1:10.1.160.104:6789/0] mon.arbiter; crush_location {datacenter=DC3}
dumped monmap epoch 12
```

See also attached:

- `cluster-spec.yaml` with cluster specification for ceph adm orchestrator
- script `cluster-install.sh` with cepmadm command to isntall the cluster
- script `cluster-postinstall.sh` with additional post install steps (creating
  RBD pool and CephFS volume)
- script `cluster-stretchsetup.sh` with commands to enable stretch setup

And further debug details:

- output of `ceph osd dump`
- plaintext version of CRUSH map dump

Additional experiment
=====================

For testing purpose, I'm going to create 22 GB rbd image, which is close to
reported MAX AVAIL, and mount directly on admin/bootstrap storage machine so
that I can try to write data there:

```
# ceph osd pool create rbdtest 32 32
pool 'rbdtest' created
# ceph osd pool application enable rbdtest rbd
enabled application 'rbd' on pool 'rbdtest'
# rbd pool init -p rbdtest
# rbd create data --size 22528 --pool rbdtest
# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    192 GiB  191 GiB  496 MiB   496 MiB       0.25
TOTAL  192 GiB  191 GiB  496 MiB   496 MiB       0.25
 
--- POOLS ---
POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics   1    1      0 B       17      0 B      0     23 GiB
.rgw.root               2   32  1.3 KiB        4   64 KiB      0     23 GiB
default.rgw.log         3   32  3.6 KiB      209  544 KiB      0     23 GiB
default.rgw.control     4   32      0 B        8      0 B      0     23 GiB
default.rgw.meta        5   32    382 B        2   32 KiB      0     23 GiB
rbdpool                 6   32      0 B        0      0 B      0     23 GiB
cephfs.cephfs.meta      7   32  2.3 KiB       22  128 KiB      0     23 GiB
cephfs.cephfs.data      8   32      0 B        0      0 B      0     23 GiB
rbdtest                 9   32  1.4 KiB        5   48 KiB      0     23 GiB
# rbd map rbdtest/data  --name client.admin 
/dev/rbd0
# mkfs.xfs /dev/rbd0 
meta-data=/dev/rbd0              isize=512    agcount=16, agsize=360448 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=5767168, imaxpct=25
         =                       sunit=16     swidth=16 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2816, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Discarding blocks...Done.
# mkdir /mnt/test
# mount -t xfs /dev/rbd0 /mnt/test
# df -h /mnt/test
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd0        22G  191M   22G   1% /mnt/test
```

Then I write 10G there:

```
# dd if=/dev/zero of=/mnt/test/10G bs=1G count=10
10+0 records in
10+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 25.6991 s, 418 MB/s
# df -h /mnt/test/
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd0        22G   11G   12G  47% /mnt/test
# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
hdd    192 GiB  151 GiB  41 GiB    41 GiB      21.26
TOTAL  192 GiB  151 GiB  41 GiB    41 GiB      21.26
 
--- POOLS ---
POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics   1    1      0 B       17      0 B      0     17 GiB
.rgw.root               2   32  1.3 KiB        4   64 KiB      0     17 GiB
default.rgw.log         3   32  3.6 KiB      209  544 KiB      0     17 GiB
default.rgw.control     4   32      0 B        8      0 B      0     17 GiB
default.rgw.meta        5   32    382 B        2   32 KiB      0     17 GiB
rbdpool                 6   32      0 B        0      0 B      0     17 GiB
cephfs.cephfs.meta      7   32  2.3 KiB       22  128 KiB      0     17 GiB
cephfs.cephfs.data      8   32      0 B        0      0 B      0     17 GiB
rbdtest                 9   32   10 GiB    2.58k   40 GiB  37.54     17 GiB
```

And we see that after writing 10GiB to the rbdtest pool, the MAX AVAIL is 6 GiB
smaller. Which is bit weird, I would understand if the decrease is smaller than
stored data, but not vice versa. We can also see that the cluster actually uses
4 replication on the RAW USED value (10 GiB x 4).

When utilize the testing rbd volume 100%:

```
# dd if=/dev/zero of=/mnt/test/full bs=1G count=20
dd: error writing '/mnt/test/full': No space left on device
12+0 records in
11+0 records out
12673089536 bytes (13 GB, 12 GiB) copied, 37.1448 s, 341 MB/s
```

I see that MAX AVAIL is now reporting 8.5 GiB available:

```
# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
hdd    192 GiB  102 GiB  90 GiB    90 GiB      46.94
TOTAL  192 GiB  102 GiB  90 GiB    90 GiB      46.94
 
--- POOLS ---
POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics   1    1      0 B       17      0 B      0    8.5 GiB
.rgw.root               2   32  1.3 KiB        4   64 KiB      0    8.5 GiB
default.rgw.log         3   32  3.6 KiB      209  544 KiB      0    8.5 GiB
default.rgw.control     4   32      0 B        8      0 B      0    8.5 GiB
default.rgw.meta        5   32    382 B        2   32 KiB      0    8.5 GiB
rbdpool                 6   32      0 B        0      0 B      0    8.5 GiB
cephfs.cephfs.meta      7   32  2.3 KiB       22  128 KiB      0    8.5 GiB
cephfs.cephfs.data      8   32      0 B        0      0 B      0    8.5 GiB
rbdtest                 9  128   22 GiB    5.59k   89 GiB  72.25    8.6 GiB
```

Which is good, as this means that we can actually utilize the cluster as
expected, we just can't rely on value of MAX AVAIL.

Comment 3 Martin Bukatovic 2022-06-24 17:06:23 UTC

Created attachment 1892557 [details]
cluster crushmap

Comment 4 Martin Bukatovic 2022-06-24 17:07:00 UTC

Created attachment 1892558 [details]
cluster-stretchsetup.sh script

Comment 7 Martin Bukatovic 2022-06-24 17:09:30 UTC

Created attachment 1892561 [details]
ceph osd dump

Comment 8 Martin Bukatovic 2022-06-24 17:11:45 UTC

Created attachment 1892562 [details]
cluster post install script

Comment 10 Martin Bukatovic 2022-06-27 09:04:21 UTC

This has an impact on ODF UI[1] (commit 21905d44f4), since I see that we use ceph_pool_max_avail metric in 2 cases:

```
$ rg --context 1 ceph_pool_max_avail
packages/ocs/queries/ceph-storage.ts
62-  [StorageDashboardQuery.CEPH_CAPACITY_AVAILABLE]:
63:    'max(ceph_pool_max_avail * on (pool_id) group_left(name)ceph_pool_metadata{name=~"(.*file.*)|(.*block.*)"})',
64-};
--
196-    [StorageDashboardQuery.POOL_RAW_CAPACITY_USED]: `ceph_pool_bytes_used * on (pool_id) group_left(name)ceph_pool_metadata{name=~'${names}'}`,
197:    [StorageDashboardQuery.POOL_MAX_CAPACITY_AVAILABLE]: `ceph_pool_max_avail * on (pool_id) group_left(name)ceph_pool_metadata{name=~'${names}'}`,
198-    [StorageDashboardQuery.POOL_UTILIZATION_IOPS_QUERY]: `(rate(ceph_pool_wr[1m]) + rate(ceph_pool_rd[1m])) * on (pool_id) group_left(name)ceph_pool_metadata{name=~'${names}'}`,
```

[1] https://github.com/red-hat-storage/odf-console

Comment 11 Martin Bukatovic 2022-06-27 09:09:53 UTC

There seems to be no impact on ODF alerting.

Comment 12 Martin Bukatovic 2022-06-27 09:11:35 UTC

This will make most of metrics/alerting tests from ocs-ci fail/timeout, as these tests rely on MAX AVAIL value (to figure out how much data to write to get cluster utilization to a given level).

Comment 13 Scott Ostapovicz 2022-06-28 14:16:35 UTC

@gfarnum are you still the expert on the stretch cluster?

Comment 14 Greg Farnum 2022-06-28 17:49:51 UTC

These stats are generated by the PGMap in the mgr. I took a brief look and am not sure what's causing this, though I imagine it's something about the two CRUSH roots and two "take" clauses in the CRUSH rule?
Handing it off to Neha and the RADOS team.

Comment 15 Vikhyat Umrao 2022-07-13 23:41:35 UTC

Had a discussion with Prashant and he will take a look.

Comment 16 Martin Bukatovic 2022-07-18 18:25:17 UTC

When I deploy ODF cluster in arbiter mode (so that stretched ceph setup fully managed by ODF), I don't see this issue:

```
bash-4.4$ ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    192 GiB  191 GiB  638 MiB   638 MiB       0.32
TOTAL  192 GiB  191 GiB  638 MiB   638 MiB       0.32
 
--- POOLS ---
POOL                                                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics                                   1    1      0 B        0      0 B      0     41 GiB
ocs-storagecluster-cephblockpool                        2  256   66 MiB       70  264 MiB   0.16     41 GiB
.rgw.root                                               3   32  4.8 KiB       16  240 KiB      0     41 GiB
ocs-storagecluster-cephobjectstore.rgw.buckets.index    4   32      0 B       22      0 B      0     41 GiB
ocs-storagecluster-cephobjectstore.rgw.control          5   32      0 B        8      0 B      0     41 GiB
ocs-storagecluster-cephobjectstore.rgw.meta             6   32  3.9 KiB       16  224 KiB      0     41 GiB
ocs-storagecluster-cephobjectstore.rgw.log              7   32   23 KiB      308  2.5 MiB      0     41 GiB
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec   8   32      0 B        0      0 B      0     41 GiB
ocs-storagecluster-cephobjectstore.rgw.buckets.data     9   32    1 KiB        1   16 KiB      0     41 GiB
ocs-storagecluster-cephfilesystem-metadata             10   32  2.3 KiB       22  128 KiB      0     41 GiB
ocs-storagecluster-cephfilesystem-data0                11   32      0 B        0      0 B      0     41 GiB
```

This is with ODF 4.11.0-113, which includes ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable).

So maybe we have a problem in the installation instructions? I will compare the crush rules.

Comment 17 Prashant Dhange 2022-07-19 15:11:01 UTC

Hi Martin,

The problem is not with the stretch mode cluster but rather with crush rule stretch_rule.

rule stretch_rule {
        id 1
        type replicated
        step take DC1
        step chooseleaf firstn 2 type host
        step emit
        step take DC2
        step chooseleaf firstn 2 type host
        step emit
}

If we change stretch_rule to below, then "MAX AVAIL" showing correct value :

rule stretch_replicated_rule {
        id 2
        type replicated
        step take default
        step choose firstn 0 type datacenter
        step chooseleaf firstn 2 type host
        step emit
}

The way crush rule stretch_rule is defined in your case, PGMap::get_rule_avail is considering only one datacenter's available size rather than total avail size (avail size = total-avail-size-from-both-dc/replication-size) from both datacenters.

More details : 
$ ceph osd crush rule ls
replicated_rule
stretch_rule
stretch_replicated_rule

$ ceph osd crush rule dump stretch_rule
{
    "rule_id": 1,
    "rule_name": "stretch_rule",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -5,
            "item_name": "DC1"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 2,
            "type": "host"
        },
        {
            "op": "emit"
        },
        {
            "op": "take",
            "item": -6,
            "item_name": "DC2"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 2,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

$ ceph osd crush rule dump stretch_replicated_rule
{
    "rule_id": 2,
    "rule_name": "stretch_replicated_rule",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default"
        },
        {
            "op": "choose_firstn",
            "num": 0,
            "type": "datacenter"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 2,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

$ ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 88 lfor 0/0/62 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 64 lfor 0/0/62 flags hashpspool stripe_width 0 application cephfs
pool 4 'rbdpool' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 126 flags hashpspool stripe_width 0
pool 5 'rbdtest' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 139 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 6 'stretched_rbdpool' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 130 flags hashpspool stripe_width 0
pool 7 'stretched_rbdtest' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 143 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 8 'stretched_replicated_rbdpool' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 134 flags hashpspool stripe_width 0
pool 9 'stretched_replicated_rbdtest' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 147 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

$ ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    1.2 TiB  960 GiB  252 GiB   252 GiB      20.81
TOTAL  1.2 TiB  960 GiB  252 GiB   252 GiB      20.81
 
--- POOLS ---
POOL                          ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                           1    1  1.5 MiB        2  4.5 MiB      0    289 GiB
cephfs.a.meta                  2   16  2.3 KiB       22   96 KiB      0    289 GiB
cephfs.a.data                  3   32      0 B        0      0 B      0    289 GiB
rbdpool                        4   32      0 B        0      0 B      0    216 GiB
rbdtest                        5   32   20 GiB    5.14k   80 GiB   8.46    216 GiB
stretched_rbdpool              6   32      0 B        0      0 B      0    108 GiB
stretched_rbdtest              7   32   20 GiB    5.14k   80 GiB  15.60    108 GiB                  
stretched_replicated_rbdpool   8   32      0 B        0      0 B      0    216 GiB
stretched_replicated_rbdtest   9   32   20 GiB    5.14k   80 GiB   8.46    216 GiB

I will investigate crush rule used by you further and see if we need to change PGMap code to fix the way available size is getting calculated for crush rule stretch_rule.

Comment 18 Prashant Dhange 2022-07-19 15:42:48 UTC

Okay. The CrushWrapper::get_rule_weight_osd_map needs fix as it has clear FIXME message on why it is ignoring 2 takes from stretch_rule which places objects on 2 hosts from each DC.

int CrushWrapper::get_rule_weight_osd_map(unsigned ruleno,
                                          map<int,float> *pmap) const
{
  if (ruleno >= crush->max_rules)
    return -ENOENT;
  if (crush->rules[ruleno] == NULL)
    return -ENOENT;
  crush_rule *rule = crush->rules[ruleno];

  // build a weight map for each TAKE in the rule, and then merge them

  // FIXME: if there are multiple takes that place a different number of
  // objects we do not take that into account.  (Also, note that doing this
  // right is also a function of the pool, since the crush rule
  // might choose 2 + choose 2 but pool size may only be 3.)
  for (unsigned i=0; i<rule->len; ++i) {
    map<int,float> m;
    float sum = 0;
    if (rule->steps[i].op == CRUSH_RULE_TAKE) {
      int n = rule->steps[i].arg1;
      if (n >= 0) {
        m[n] = 1.0;
        sum = 1.0;
      } else {
        sum += _get_take_weight_osd_map(n, &m);
      }
    }
    _normalize_weight_map(sum, m, pmap);
  }

  return 0;
}

Comment 19 Vikhyat Umrao 2022-07-19 17:39:42 UTC

(In reply to Prashant Dhange from comment #18)
> Okay. The CrushWrapper::get_rule_weight_osd_map needs fix as it has clear
> FIXME message on why it is ignoring 2 takes from stretch_rule which places
> objects on 2 hosts from each DC.
> 
> int CrushWrapper::get_rule_weight_osd_map(unsigned ruleno,
>                                           map<int,float> *pmap) const
> {
>   if (ruleno >= crush->max_rules)
>     return -ENOENT;
>   if (crush->rules[ruleno] == NULL)
>     return -ENOENT;
>   crush_rule *rule = crush->rules[ruleno];
> 
>   // build a weight map for each TAKE in the rule, and then merge them
> 
>   // FIXME: if there are multiple takes that place a different number of
>   // objects we do not take that into account.  (Also, note that doing this
>   // right is also a function of the pool, since the crush rule
>   // might choose 2 + choose 2 but pool size may only be 3.)
>   for (unsigned i=0; i<rule->len; ++i) {
>     map<int,float> m;
>     float sum = 0;
>     if (rule->steps[i].op == CRUSH_RULE_TAKE) {
>       int n = rule->steps[i].arg1;
>       if (n >= 0) {
>         m[n] = 1.0;
>         sum = 1.0;
>       } else {
>         sum += _get_take_weight_osd_map(n, &m);
>       }
>     }
>     _normalize_weight_map(sum, m, pmap);
>   }
> 
>   return 0;
> }

Good work, Prashant. The moment you confirm it is a bug and you think a fix is needed maybe clone an RHCS bug for this ODF bug.

Comment 20 Martin Bukatovic 2022-07-19 18:19:30 UTC

Good catch with the stretched crush rule.

When I inspected rules in crush map on a stretched ceph cluster managed by ODF, I noticed that indeed the rules differ:

```
# rules
rule replicated_rule {
        id 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}
rule default_stretch_cluster_rule {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type zone
        step chooseleaf firstn 2 type host
        step emit
}
```

Which explains why I don't see the problem there.

That said I'm not sure why ODF is using different stretch rule. In our MetroDR guide related to stretched ceph, we adopted the suggestion from upstream if I recall right:

https://docs.ceph.com/en/latest/rados/operations/stretch-mode/

Comment 21 Prashant Dhange 2022-07-20 14:16:55 UTC

(In reply to Vikhyat Umrao from comment #19)
> (In reply to Prashant Dhange from comment #18)
> > Okay. The CrushWrapper::get_rule_weight_osd_map needs fix as it has clear
> > FIXME message on why it is ignoring 2 takes from stretch_rule which places
> > objects on 2 hosts from each DC.
...
...
> 
> Good work, Prashant. The moment you confirm it is a bug and you think a fix
> is needed maybe clone an RHCS bug for this ODF bug.
I have cloned this BZ to RHCS bug https://bugzilla.redhat.com/show_bug.cgi?id=2109129

Comment 22 Prashant Dhange 2022-07-20 14:20:04 UTC

(In reply to Martin Bukatovic from comment #20)
> Good catch with the stretched crush rule.
> 
> When I inspected rules in crush map on a stretched ceph cluster managed by
> ODF, I noticed that indeed the rules differ:
> 
> ```
> # rules
> rule replicated_rule {
>         id 0
>         type replicated
>         min_size 1
>         max_size 10
>         step take default
>         step chooseleaf firstn 0 type host
>         step emit
> }
> rule default_stretch_cluster_rule {
>         id 1
>         type replicated
>         min_size 1
>         max_size 10
>         step take default
>         step choose firstn 0 type zone
>         step chooseleaf firstn 2 type host
>         step emit
> }
> ```
> 
> Which explains why I don't see the problem there.
> 
> That said I'm not sure why ODF is using different stretch rule. In our
> MetroDR guide related to stretched ceph, we adopted the suggestion from
> upstream if I recall right:
> 
> https://docs.ceph.com/en/latest/rados/operations/stretch-mode/

Thanks Martin. I have opened upstream PR#47189 to fix this inconsistency (more details on RHCS bug BZ#2109129).

Comment 23 Mudit Agarwal 2022-07-25 06:57:13 UTC

Not a 4.11 blocker

Comment 26 Mudit Agarwal 2022-10-26 03:32:52 UTC

Ceph BZ is targeted for 6.1