Bug 2219229

Summary: PGs are not being autoscaled to desired levels when the cluster has OSDs of multiple device classes with Custom CRUSH rules
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Pawan <pdhiran>
Component: RADOSAssignee: Kamoltat (Junior) Sirivadhna <ksirivad>
Status: NEW --- QA Contact: Pawan <pdhiran>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.1CC: bhubbard, ceph-eng-bugs, cephqe-warriors, dparkes, nojha, sostapov, vimishra, vumrao
Target Milestone: ---   
Target Release: 6.1z2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pawan 2023-07-03 04:28:10 UTC
Description of problem:
PGs are not being autoscaled to desired levels when the cluster has OSDs of multiple device classes.

pool 10 'ecpool-42-new' erasure profile ec42 size 6 min_size 5 crush_rule 2 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 280 flags hashpspool stripe_width 16384 application rados
pool 11 'test' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 432 flags hashpspool stripe_width 0 application rados
pool 12 'test2' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 437 flags hashpspool stripe_width 0 application rados

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME                                     STATUS  REWEIGHT  PRI-AFF
-1         0.78070  root default
-7         0.19518      host ceph-pdhiran-rnxgw8-node1-installer
19    hdd  0.02440          osd.19                                    up   1.00000  1.00000
23    hdd  0.02440          osd.23                                    up   1.00000  1.00000
26    hdd  0.02440          osd.26                                    up   1.00000  1.00000
30    hdd  0.02440          osd.30                                    up   1.00000  1.00000
 3    ssd  0.02440          osd.3                                     up   1.00000  1.00000
 7    ssd  0.02440          osd.7                                     up   1.00000  1.00000
11    ssd  0.02440          osd.11                                    up   1.00000  1.00000
15    ssd  0.02440          osd.15                                    up   1.00000  1.00000
-3         0.19518      host ceph-pdhiran-rnxgw8-node2
16    hdd  0.02440          osd.16                                    up   1.00000  1.00000
20    hdd  0.02440          osd.20                                    up   1.00000  1.00000
24    hdd  0.02440          osd.24                                    up   1.00000  1.00000
28    hdd  0.02440          osd.28                                    up   1.00000  1.00000
 0    ssd  0.02440          osd.0                                     up   1.00000  1.00000
 4    ssd  0.02440          osd.4                                     up   1.00000  1.00000
 8    ssd  0.02440          osd.8                                     up   1.00000  1.00000
12    ssd  0.02440          osd.12                                    up   1.00000  1.00000
-9         0.19518      host ceph-pdhiran-rnxgw8-node3
18    hdd  0.02440          osd.18                                    up   1.00000  1.00000
22    hdd  0.02440          osd.22                                    up   1.00000  1.00000
27    hdd  0.02440          osd.27                                    up   1.00000  1.00000
31    hdd  0.02440          osd.31                                    up   1.00000  1.00000
 2    ssd  0.02440          osd.2                                     up   1.00000  1.00000
 6    ssd  0.02440          osd.6                                     up   1.00000  1.00000
10    ssd  0.02440          osd.10                                    up   1.00000  1.00000
14    ssd  0.02440          osd.14                                    up   1.00000  1.00000
-5         0.19518      host ceph-pdhiran-rnxgw8-node4
17    hdd  0.02440          osd.17                                    up   1.00000  1.00000
21    hdd  0.02440          osd.21                                    up   1.00000  1.00000
25    hdd  0.02440          osd.25                                    up   1.00000  1.00000
29    hdd  0.02440          osd.29                                    up   1.00000  1.00000
 1    ssd  0.02440          osd.1                                     up   1.00000  1.00000
 5    ssd  0.02440          osd.5                                     up   1.00000  1.00000
 9    ssd  0.02440          osd.9                                     up   1.00000  1.00000
13    ssd  0.02440          osd.13                                    up   1.00000  1.00000

# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    400 GiB  390 GiB   10 GiB    10 GiB       2.53
ssd    400 GiB  394 GiB  6.0 GiB   6.0 GiB       1.50
TOTAL  800 GiB  784 GiB   16 GiB    16 GiB       2.02

--- POOLS ---
POOL                 ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                  1    1  449 KiB        2  1.3 MiB      0    211 GiB
cephfs.cephfs.meta    2   16  4.6 KiB       22   96 KiB      0    211 GiB
cephfs.cephfs.data    3   32      0 B        0      0 B      0    211 GiB
.rgw.root             4   32  1.3 KiB        4   48 KiB      0    211 GiB
default.rgw.log       5   32  3.6 KiB      209  408 KiB      0    211 GiB
default.rgw.control   6   32      0 B        8      0 B      0    211 GiB
default.rgw.meta      7   32    382 B        3   24 KiB      0    211 GiB
ecpool-21             8   32      0 B        0      0 B      0    211 GiB
ecpool-42-new        10    1      0 B        0      0 B      0    423 GiB
test                 11    1  3.7 GiB   39.31k   11 GiB   1.74    211 GiB
test2                12    1      0 B        0      0 B      0    211 GiB

Version-Release number of selected component (if applicable):
ceph version 17.2.6-76.el9cp (7d277f1e8500eb73e50260771e11b7bd7d6f34af) quincy (stable)

How reproducible:
3/3 times for pools creation on same cluster

Steps to Reproduce:
1. Deploy RHCS cluster, with all the services.
2. FOr some OSDs, remove the existing device class and set 'ssd' as the device class
ceph osd crush set-device-class ssd 12
ceph osd crush rm-device-class 12

3. Create pools post this operation. PGs are not being autoscaled to desired levels, even after IOs.

Also, the o/p of autoscale-status is empty on the cluster. Not sure if this deserves another separate bug.
# ceph osd pool autoscale-status -f json-pretty

[]

[ceph: root@ceph-pdhiran-rnxgw8-node1-installer /]# ceph -s
  cluster:
    id:     0acee9a0-141b-11ee-b0ba-fa163eb5f775
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-pdhiran-rnxgw8-node1-installer,ceph-pdhiran-rnxgw8-node3,ceph-pdhiran-rnxgw8-node2 (age 6d)
    mgr: ceph-pdhiran-rnxgw8-node1-installer.ceobbf(active, since 6d), standbys: ceph-pdhiran-rnxgw8-node4.qpjdxs, ceph-pdhiran-rnxgw8-node2.zvsusw
    mds: 1/1 daemons up, 2 standby
    osd: 32 osds: 32 up (since 6d), 32 in (since 6d)
    rgw: 4 daemons active (4 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   11 pools, 212 pgs
    objects: 39.56k objects, 3.7 GiB
    usage:   16 GiB used, 784 GiB / 800 GiB avail
    pgs:     212 active+clean


CRUSH map on the cluster : http://pastebin.test.redhat.com/1103986 

Actual results:
PGs are not scaling up and Autoscaler command o/p is empty

Expected results:
PGs should be scaled automatically

Additional info:

Comment 2 Scott Ostapovicz 2023-07-12 12:43:38 UTC
Missed the 6.1 z1 window.  Retargeting to 6.1 z2.