Bug 2214864 - Pools are marked full even if their PGs have no FULL OSDs
Summary: Pools are marked full even if their PGs have no FULL OSDs
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 6.1
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 6.1z1
Assignee: Radoslaw Zarzynski
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-13 23:05 UTC by Harsh Kumar
Modified: 2023-07-31 06:41 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-11 19:51:27 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-6826 0 None None None 2023-06-13 23:07:52 UTC

Description Harsh Kumar 2023-06-13 23:05:45 UTC
Description of problem:
Scenario is pretty much similar to Red HatBZ#1560802
If One or more OSDs are FULL, IOs are expected to fail for all the pools which have any one these OSDs in their acting set, however, any pool which does not contain any of the FULL OSDs in its acting set should be able to serve IOs and should NOT be marked as 'full (no space)' in health warning.

Version-Release number of selected component (if applicable):
ceph Version 17.2.6-70.el9cp quincy (stable)
ceph version 16.2.10-175.el8cp (8c714c8184d123e241e34f9c0f6abcc1d1858e1c) pacific (stable)

How reproducible:
5/5

Steps to Reproduce:
1. Create a replicated pool with single PG for sake of convenience
2. Fetch acting set for the created pool (say, these OSDs are named x, y, z)
3. Re-weight the OSDs(x,y,z) part of the acting set to 0 to prevent inclusion during creation of subsequent pools
4. Create another single PG replicated pool and a single PG EC pool (single PG is not necessary, just for convenience)
5. Re-weight the 0 weight OSDs(x,y,z) to 1
6. Fetch acting set of new replicated and EC pool, ensure these acting sets do not have any of the OSDs which are part of the first replicated pool
7. Either decrease the nearfill-full, backfill-full, full ratios to 0.6, 0.6, and 0.7 respectively or continue with standard values of 0.85, 0.9, and 0.95
8. Write data to replicated pool 1 using any tool till full capacity is reached.
9. Once OSDs(x,y,z) are full, cluster health warning will have warning about these OSDs being full along with every pool being full including the ones that do not have these OSDs in their acting set


Actual results:
All the pools in the cluster are marked full when one or more OSDs are full

Expected results:
Pools having full OSDs should not be able to serve/accept I/Os and should be marked as full but Pools which do not have full OSDs should not be marked full and should be able to serve I/O

Additional info:
 cephadm -v shell -- ceph osd pool create pool_full_osds 1 1
 cephadm -v shell -- sudo ceph osd pool application enable pool_full_osds rados
 cephadm -v shell -- ceph osd pool set pool_full_osds pg_autoscale_mode off 
 cephadm -v shell -- ceph pg map 21.0 -f json 
 cephadm -v shell -- ceph osd reweight osd.9 0 
 cephadm -v shell -- ceph osd reweight osd.2 0 
 cephadm -v shell -- ceph osd reweight osd.4 0 
 cephadm -v shell -- ceph osd pool create re_pool_test 1 1 
 cephadm -v shell -- sudo ceph osd pool application enable re_pool_test rados 
 cephadm -v shell -- ceph osd pool set re_pool_test pg_autoscale_mode off 
 cephadm -v shell -- ceph osd dump -f json 
 cephadm -v shell -- ceph pg map 22.0 -f json 
 cephadm -v shell -- ceph osd erasure-code-profile set ecprofile_test_ec_pool crush-failure-domain=osd k=4 m=2 plugin=jerasure 
 cephadm -v shell -- ceph osd pool create test_ec_pool 1 1 erasure ecprofile_test_ec_pool 
 cephadm -v shell -- sudo ceph osd pool application enable test_ec_pool rados 
 cephadm -v shell -- ceph osd pool set test_ec_pool pg_autoscale_mode off 
 cephadm -v shell -- ceph pg map 23.0 -f json 
 Acting set of Pool 1: [9, 2, 4] | Acting set of Pool 2: [3, 7, 5] | Acting set of Pool 3: [10, 8, 14, 6, 12, 1]
 cephadm -v shell -- ceph osd set-nearfull-ratio 0.65 
 cephadm -v shell -- ceph osd set-backfillfull-ratio 0.69 
 cephadm -v shell -- ceph osd set-full-ratio 0.70 
 cephadm -v shell -- ceph osd reweight osd.9 1 
 cephadm -v shell -- ceph osd reweight osd.2 1 
 cephadm -v shell -- ceph osd reweight osd.4 1 

 rados bench -p pool_full_osds 200 write  -b 16384KB --no-cleanup --max-objects 1120 on 10.0.208.231 timeout 600

 cephadm -v shell -- ceph health detail 

 HEALTH_ERR 3 full osd(s); 10 pool(s) full
[ERR] OSD_FULL: 3 full osd(s)
    osd.2 is full
    osd.4 is full
    osd.9 is full
[WRN] POOL_FULL: 10 pool(s) full
    pool 'device_health_metrics' is full (no space)
    pool 'cephfs.cephfs.meta' is full (no space)
    pool 'cephfs.cephfs.data' is full (no space)
    pool '.rgw.root' is full (no space)
    pool 'default.rgw.log' is full (no space)
    pool 'default.rgw.control' is full (no space)
    pool 'default.rgw.meta' is full (no space)
    pool 'pool_full_osds' is full (no space)
    pool 're_pool_test' is full (no space)
    pool 'test_ec_pool' is full (no space)

Comment 2 Harsh Kumar 2023-06-14 21:55:44 UTC
@Radosla


Note You need to log in before you can comment on or make changes to this bug.