Bug 2302230

Summary:	[Reads Balancer] PGs not getting scaled down post removal of bulk flag on the cluster
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Pawan <pdhiran>
Component:	RADOS	Assignee:	Laura Flores <lflores>
Status:	ASSIGNED ---	QA Contact:	Pawan <pdhiran>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.1	CC:	bhubbard, ceph-eng-bugs, cephqe-warriors, lflores, ngangadh, nojha, rpollack, rzarzyns, vumrao, yhatuka
Target Milestone:	---	Keywords:	Automation, TestBlocker
Target Release:	9.0	Flags:	lflores: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	.Placement groups are not scaled down in `upmap-read` and `read` balancer modes Currently, `pg-upmap-primary` entries are not properly removed for placement groups (PGs) that are pending merge. For example, when the bulk flag is removed on a pool, or any case where the number of PGs in a pool decreases. As a result, the PG scale-down process gets stuck and the number of PGs in the affected pool do not decrease as expected. As a workaround, remove the `pg_upmap_primary` entries in the OSD map of the affected pool. To view the entries, run the `ceph osd dump` command and then run `ceph osd rm-pg-upmap-primary PG_ID` for reach PG in the affected pool. After using the workaround, the PG scale-down process resumes as expected.	Story Points:	---
Clone Of:
Clones:	2357061 (view as bug list)		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2357061, 2317218

Description Pawan 2024-08-01 10:04:31 UTC

Description of problem:
PGs are not being scaled down on the pool after disabling the bulk flag on the pool. 

The New calculated pg num is being displayed in autoscale-status and ls pool detail, but the actual scale-down of the PGs in pool is not happening.

# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 37 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 20.00
pool 2 'cephfs.cephfs.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 91 lfor 0/0/54 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs read_balance_score 2.51
pool 3 'cephfs.cephfs.data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode on last_change 73 lfor 0/0/71 flags hashpspool,bulk stripe_width 0 application cephfs read_balance_score 1.05
pool 8 'balancer_test_pool' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 243 pgp_num 231 pg_num_target 32 pgp_num_target 32 autoscale_mode on last_change 395 lfor 0/393/391 flags hashpspool stripe_width 0 application rados read_balance_score 1.32


[root@ceph-pdhiran-cdx69q-node1-installer ~]# ceph osd pool autoscale-status
POOL                  SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
.mgr                602.6k                3.0        499.9G  0.0000                                  1.0       1              on         False
cephfs.cephfs.meta  32768                 3.0        499.9G  0.0000                                  4.0      16              on         False
balancer_test_pool   2004k                3.0        499.9G  0.0000                                  1.0      32              on         False
cephfs.cephfs.data      0                 3.0        499.9G  0.0000                                  1.0     512              on         True

# ceph df detail
--- RAW STORAGE ---
CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
hdd    500 GiB  489 GiB  11 GiB    11 GiB       2.24
TOTAL  500 GiB  489 GiB  11 GiB    11 GiB       2.24

--- POOLS ---
POOL                ID  PGS   STORED   (DATA)  (OMAP)  OBJECTS     USED   (DATA)  (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
.mgr                 1    1  598 KiB  598 KiB     0 B        2  1.8 MiB  1.8 MiB     0 B      0    154 GiB            N/A          N/A    N/A         0 B          0 B
cephfs.cephfs.meta   2   16  2.4 KiB  2.4 KiB     0 B       22   96 KiB   96 KiB     0 B      0    154 GiB            N/A          N/A    N/A         0 B          0 B
cephfs.cephfs.data   3  512      0 B      0 B     0 B        0      0 B      0 B     0 B      0    154 GiB            N/A          N/A    N/A         0 B          0 B
balancer_test_pool   8  243  2.0 MiB  2.0 MiB     0 B      501  5.9 MiB  5.9 MiB     0 B      0    154 GiB            N/A          N/A    N/A         0 B          0 B

Version-Release number of selected component (if applicable):
# ceph version
ceph version 19.1.0-4.el9cp (b2c7ded5f7885ce1d488a241a30cba80f58d28bc) squid (rc)

How reproducible:
5/5 via automated runs

Steps to Reproduce:
1. Update the balancer mode on the cluster to upmap-read.
cmd : ceph balancer mode upmap-read

2. Create a test pool (balancer_test_pool), enable the application on the pool and write some data.

3. Once few objects are created on the pool, enable the bulk flag.
cmd : ceph osd pool set balancer_test_pool bulk true

4. Once the bulk flag is set on the pool, the new PG num is calculated for the pool, and the PGs are split to increase to desired count. In this case, the New PG num was calculated to be 256, and the pool is scaled up to 256 PGs.

5. Once the scale up is complete, remove the bulk flag on the pool.
cmd : ceph osd pool set balancer_test_pool bulk false

6. After removing the bulk flag, new PG count is again calculated on the pool. In this case, 32 PGs. But the PG scale down is stuck, and does not reduce the PG count on the pool even after 30 mins. Note that there is no IOs going on the cluster.

From balancer logs : 

    "balancer_test_pool": {
        "application_metadata": {
            "rados": {}
        },
        "auid": 0,
        "cache_min_evict_age": 0,
        "cache_min_flush_age": 0,
        "cache_mode": "none",
        "cache_target_dirty_high_ratio_micro": 600000,
        "cache_target_dirty_ratio_micro": 400000,
        "cache_target_full_ratio_micro": 800000,
        "create_time": "2024-08-01T09:23:57.179867+0000",
        "crush_rule": 0,
        "erasure_code_profile": "",
        "expected_num_objects": 0,
        "fast_read": false,
        "flags": 1,
        "flags_names": "hashpspool",
        "grade_table": [],
        "hit_set_count": 0,
        "hit_set_grade_decay_rate": 0,
        "hit_set_params": {
            "type": "none"
        },
        "hit_set_period": 0,
        "hit_set_search_last_n": 0,
        "last_change": "395",
        "last_force_op_resend": "0",
        "last_force_op_resend_preluminous": "391",
        "last_force_op_resend_prenautilus": "393",
        "last_pg_merge_meta": {
            "last_epoch_clean": 392,
            "last_epoch_started": 392,
            "ready_epoch": 393,
            "source_pgid": "8.f3",
            "source_version": "162'27",
            "target_version": "162'27"
        },
        "min_read_recency_for_promote": 0,
        "min_size": 2,
        "min_write_recency_for_promote": 0,
        "object_hash": 2,
        "options": {},
        "peering_crush_bucket_barrier": 0,
        "peering_crush_bucket_count": 0,
        "peering_crush_bucket_mandatory_member": 2147483647,
        "peering_crush_bucket_target": 0,
        "pg_autoscale_mode": "on",
        "pg_num": 243,
        "pg_num_pending": 243,
        "pg_num_target": 32,
        "pg_placement_num": 231,
        "pg_placement_num_target": 32,
        "pool": 8,
        "pool_name": "balancer_test_pool",
        "pool_snaps": [],
        "quota_max_bytes": 0,
        "quota_max_objects": 0,
        "read_balance": {
            "average_primary_affinity": 1.0,
            "average_primary_affinity_weighted": 1.0,
            "optimal_score": 1.0,
            "primary_affinity_weighted": 1.0000001192092896,
            "raw_score_acting": 1.3200000524520874,
            "raw_score_stable": 1.3200000524520874,
            "score_acting": 1.3200000524520874,
            "score_stable": 1.3200000524520874,
            "score_type": "Fair distribution"
        },
        "read_tier": -1,
        "removed_snaps": "[]",
        "size": 3,
        "snap_epoch": 0,
        "snap_mode": "selfmanaged",
        "snap_seq": 0,
        "stripe_width": 0,
        "target_max_bytes": 0,
        "target_max_objects": 0,
        "tier_of": -1,
        "tiers": [],
        "type": 1,
        "use_gmt_hitset": true,
        "write_tier": -1
    },
}
2024-08-01 10:00:24,781 [Dummy-2] [DEBUG] [root] root_ids [-1, -1, -1, -1] pools [1, 2, 3, 8] with 20 osds, pg_target 2000
2024-08-01 10:00:24,781 [Dummy-2] [INFO] [root] effective_target_ratio 0.0 0.0 0 536787025920
2024-08-01 10:00:24,781 [Dummy-2] [INFO] [root] Pool '.mgr' root_id -1 using 3.449025238318487e-06 of space, bias 1.0, pg target 0.0022993501588789915 quantized to 1 (current 1)
2024-08-01 10:00:24,782 [Dummy-2] [INFO] [root] effective_target_ratio 0.0 0.0 0 536787025920
2024-08-01 10:00:24,782 [Dummy-2] [INFO] [root] Pool 'cephfs.cephfs.meta' root_id -1 using 1.8313408345053914e-07 of space, bias 4.0, pg target 0.0004883575558681044 quantized to 16 (current 16)
2024-08-01 10:00:24,782 [Dummy-2] [INFO] [root] effective_target_ratio 0.0 0.0 0 536787025920
2024-08-01 10:00:24,782 [Dummy-2] [INFO] [root] effective_target_ratio 0.0 0.0 0 536787025920
2024-08-01 10:00:24,782 [Dummy-2] [INFO] [root] Pool 'balancer_test_pool' root_id -1 using 1.1468771976090014e-05 of space, bias 1.0, pg target 0.0076458479840600104 quantized to 32 (current 32)
2024-08-01 10:00:24,782 [Dummy-2] [INFO] [root] effective_target_ratio 0.0 0.0 0 536787025920
2024-08-01 10:00:24,782 [Dummy-2] [INFO] [root] effective_target_ratio 0.0 0.0 0 536787025920
2024-08-01 10:00:24,782 [Dummy-2] [INFO] [root] Pool 'cephfs.cephfs.data' root_id -1 using 0.0 of space, bias 1.0, pg target 666.6666666666666 quantized to 512 (current 512)

Actual results:
The PGs are not being scaled down on the pool, once new PG count is identified by the PG Autoscaler.

Expected results:
The PGs should be scaled down on the pool, once new PG count is identified by the PG Autoscaler

Additional info:
Attaching ceph, mon & PG autoscaler logs