Description of problem (please be detailed as possible and provide log snippests): 1.I was able to delete the PVCs when the POOL = 100% and USED_CAPACITY = 85%, 2.The PVs moved to released althought the RECLAIM_POLICY of ceph-rbd storage class is DELETE. 3.After deleting the PV manually, the data is not deleted 4.I used this KCS https://access.redhat.com/solutions/3001761 and changed the set-full-ratio to 97% [from 85%] 5.The relevant data deleted, however the data is not rebalanced between the disks [On disk2 the used capacity is 83% and on disk1 and disk0 used capacity=30%] sh-4.4$ ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 2 hdd 0.50000 1.00000 512 GiB 429 GiB 428 GiB 113 KiB 918 MiB 83 GiB 83.70 1.75 145 up 1 hdd 0.50000 1.00000 512 GiB 154 GiB 152 GiB 141 KiB 1.6 GiB 358 GiB 30.02 0.63 177 up 0 hdd 0.50000 1.00000 512 GiB 154 GiB 152 GiB 141 KiB 1.6 GiB 358 GiB 30.02 0.63 177 up TOTAL 1.5 TiB 736 GiB 732 GiB 397 KiB 4.1 GiB 800 GiB 47.91 Version of all relevant components (if applicable): ODF Version: 4.11.0-69 OCP Version: 4.11.0-0.nightly-2022-05-11-054135 OSD Size:512G Num of disks:3 Provider: Vmware sh-4.4$ ceph versions { "mon": { "ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)": 1 }, "osd": { "ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)": 3 }, "mds": { "ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)": 2 }, "rgw": { "ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)": 1 }, "overall": { "ceph version 16.2.7-112.el8cp (e18db2ff03ac60c64a18f3315c032b9d5a0a3b8f) pacific (stable)": 10 } } Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Full fill 85% Disks with Benchmark Operator [10 FIO PODS+ 10 PVCs] https://github.com/Oded1990/odf-scripts/blob/main/interactive_scripts/run_benchmark_fio.py 2. Check Ceph df: sh-4.4$ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 1.5 TiB 230 GiB 1.3 TiB 1.3 TiB 85.00 TOTAL 1.5 TiB 230 GiB 1.3 TiB 1.3 TiB 85.00 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL ocs-storagecluster-cephblockpool 1 32 433 GiB 111.64k 1.3 TiB 100.00 0 B device_health_metrics 2 1 4.2 KiB 3 13 KiB 100.00 0 B ocs-storagecluster-cephobjectstore.rgw.log 3 8 46 KiB 340 2.0 MiB 100.00 0 B ocs-storagecluster-cephobjectstore.rgw.control 4 8 0 B 8 0 B 0 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 5 8 0 B 0 0 B 0 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.index 6 8 8.3 KiB 22 25 KiB 100.00 0 B ocs-storagecluster-cephobjectstore.rgw.meta 7 8 15 KiB 16 201 KiB 100.00 0 B .rgw.root 8 8 4.9 KiB 16 180 KiB 100.00 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.data 9 32 1 KiB 1 12 KiB 100.00 0 B ocs-storagecluster-cephfilesystem-metadata 10 32 3.1 MiB 25 9.5 MiB 100.00 0 B ocs-storagecluster-cephfilesystem-data0 11 32 708 MiB 202 2.1 GiB 100.00 0 B 3. Delete all fio pods + pvcs [work as expected] http://pastebin.test.redhat.com/1052896 4. Wait ~1H 5. Check Ceph status and osd df: sh-4.4$ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 1.5 TiB 230 GiB 1.3 TiB 1.3 TiB 85.00 TOTAL 1.5 TiB 230 GiB 1.3 TiB 1.3 TiB 85.00 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL ocs-storagecluster-cephblockpool 1 32 433 GiB 111.64k 1.3 TiB 100.00 0 B device_health_metrics 2 1 4.2 KiB 3 13 KiB 100.00 0 B ocs-storagecluster-cephobjectstore.rgw.log 3 8 46 KiB 340 2.0 MiB 100.00 0 B ocs-storagecluster-cephobjectstore.rgw.control 4 8 0 B 8 0 B 0 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 5 8 0 B 0 0 B 0 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.index 6 8 8.3 KiB 22 25 KiB 100.00 0 B ocs-storagecluster-cephobjectstore.rgw.meta 7 8 15 KiB 16 201 KiB 100.00 0 B .rgw.root 8 8 4.9 KiB 16 180 KiB 100.00 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.data 9 32 1 KiB 1 12 KiB 100.00 0 B ocs-storagecluster-cephfilesystem-metadata 10 32 3.1 MiB 25 9.5 MiB 100.00 0 B ocs-storagecluster-cephfilesystem-data0 11 32 708 MiB 202 2.1 GiB 100.00 0 B *********************************************************** sh-4.4$ ceph status cluster: id: a28570c8-d885-4974-ba0a-f89964c007d6 health: HEALTH_ERR 1 backfillfull osd(s) 2 full osd(s) 11 pool(s) full services: mon: 3 daemons, quorum a,b,c (age 2d) mgr: a(active, since 2d) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 2d), 3 in (since 2d) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 11 pools, 177 pgs objects: 112.27k objects, 434 GiB usage: 1.3 TiB used, 230 GiB / 1.5 TiB avail pgs: 177 active+clean io: client: 853 B/s rd, 1 op/s rd, 0 op/s wr ***************************************************************** sh-4.4$ ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 2 hdd 0.50000 1.00000 512 GiB 435 GiB 434 GiB 98 KiB 1.7 GiB 77 GiB 85.00 1.00 177 up 1 hdd 0.50000 1.00000 512 GiB 435 GiB 434 GiB 98 KiB 1.6 GiB 77 GiB 85.00 1.00 177 up 0 hdd 0.50000 1.00000 512 GiB 435 GiB 434 GiB 98 KiB 1.7 GiB 77 GiB 85.00 1.00 177 up TOTAL 1.5 TiB 1.3 TiB 1.3 TiB 297 KiB 5.0 GiB 230 GiB 85.00 MIN/MAX VAR: 1.00/1.00 STDDEV: 0 6. Restart rook ceph operator pod $ oc delete pod rook-ceph-operator-64cfc9c7df-47w97 7. Check Ceph status and osd df: [same results like step 5] 8.Check ceph-rbd sc and the RECLAIMPOLICY is "Delete" $ oc get sc ocs-storagecluster-ceph-rbd NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE ocs-storagecluster-ceph-rbd openshift-storage.rbd.csi.ceph.com Delete Immediate true 3d 9.All old PVS were on the Released state http://pastebin.test.redhat.com/1052965 $ oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-183a828d-6dbf-4554-bc10-4a732fa58bcf 40Gi RWO Delete Released benchmark-operator/claim-10-16e9a170 ocs-storagecluster-ceph-rbd 2d20h 10.Deleted all pvs manually [the pvs deleted] 11.Wait ~ 1H : The USED CAPACITY stuck on 85% and pools on 100% Why the PVs are not deleted if RECLAIMPOLICY=Delete? 12.Changed the set-full-ratio [from 0.85 to 0.97] bash-4.4$ ceph osd set-full-ratio 0.97 osd set-full-ratio 0.97 13.Wait 1H 14.Relevant data deleted sh-4.4$ ceph status cluster: id: a28570c8-d885-4974-ba0a-f89964c007d6 health: HEALTH_WARN 1 backfillfull osd(s) Low space hindering backfill (add storage if this doesn't resolve itself): 32 pgs backfill_toofull Degraded data redundancy: 39294/119781 objects degraded (32.805%), 32 pgs degraded, 32 pgs undersized 11 pool(s) backfillfull services: mon: 3 daemons, quorum a,b,c (age 28h) mgr: a(active, since 8d) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 25h), 3 in (since 8d); 32 remapped pgs rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 11 pools, 177 pgs objects: 39.93k objects, 153 GiB usage: 736 GiB used, 800 GiB / 1.5 TiB avail pgs: 39294/119781 objects degraded (32.805%) 145 active+clean 32 active+undersized+degraded+remapped+backfill_toofull io: client: 1.2 KiB/s rd, 15 KiB/s wr, 2 op/s rd, 1 op/s wr progress: Global Recovery Event (1h) [======================......] (remaining: 5h) sh-4.4$ ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 2 hdd 0.50000 1.00000 512 GiB 429 GiB 428 GiB 113 KiB 918 MiB 83 GiB 83.70 1.75 145 up 1 hdd 0.50000 1.00000 512 GiB 154 GiB 152 GiB 141 KiB 1.6 GiB 358 GiB 30.02 0.63 177 up 0 hdd 0.50000 1.00000 512 GiB 154 GiB 152 GiB 141 KiB 1.6 GiB 358 GiB 30.02 0.63 177 up TOTAL 1.5 TiB 736 GiB 732 GiB 397 KiB 4.1 GiB 800 GiB 47.91 MIN/MAX VAR: 0.63/1.75 STDDEV: 25.31 15.The data is not rebalanced after more than 48H. On disk2 the used capacity is 83% and on disk1 and disk0 used capacity=30% Actual results: The data is not rebalanced Expected results: The data is balanced Additional info: ODF MG: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2090338/
Not a 4.11 blocker, created a ceph tracker for better attention.
(In reply to Mudit Agarwal from comment #2) > Not a 4.11 blocker, created a ceph tracker for better attention. This is not a bug, the correct config option was not used provided the feedback in the Ceph bug[1], please test again and it should help. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2096194#c2
(In reply to Vikhyat Umrao from comment #3) > (In reply to Mudit Agarwal from comment #2) > > Not a 4.11 blocker, created a ceph tracker for better attention. > > This is not a bug, the correct config option was not used provided the > feedback in the Ceph bug[1], please test again and it should help. > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=2096194#c2 https://bugzilla.redhat.com/show_bug.cgi?id=2096194#c3 - closing this one as NOTABUG as it is working fine now with correct config option.