Description of problem: This has occurred to me twice in the last 2 weeks, desperately needs a fix. I am conducting benchmarking of Ceph object storage, per the workflow, I ingested 500 Million S3 64K objects into the cluster (EC 4+2 data pool). Due to some unrelated reason, i had to delete all the objects from the cluster. So the best quickest ways was to delete the entire RGW data pool containing 500 Million objects. As soon as i did that, i found CPU reaching to 99% on all Ceph OSD nodes. At the same time observed 100% NVMe (bluestore device) across all devices on all OSD nodes. Upon further investigation, found that - NVMe are 100% utilized because of insane read IOs (almost no write IO) - This is causing super high IO WAIT on the CPU, causing CPU saturation too - The OSD devices (HDD) were idle, not doing anything. I left the system like this for another 8 hours (overnight), later next morning the system still was unusable, 100% CPU utilization, 100% NVMe utilization. As i was running short on time, i had to purge (PV,VG,LV) and re-deploy the entire cluster Version-Release number of selected component (if applicable): RHCS 4.1 How reproducible: ALWAYS Steps to Reproduce: 1. Fill Ceph cluster pools with high amount of objects (ex : 500 Million) 2. delete the pool which is storing those 500M objects 3. Check OSd nodes CPU utilization, NVME (Bluestore) utilization Actual results: Deleting large pool, causing CPU / NVMe saturation. Making the cluster unusable Expected results: After deleting pools (i.e all the underlying PG), Ceph cluster should reclaim all the capacity in a several minutes (if not instantaneously) and should not impact CPU / NVMe devices utilization. Best would be if a user deletes a pool, ceph will just logically skip those blocks and just overwrite those at the time of writing the data ? (Just a thought) Additional info:
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.
I have previously observed similar behaviour from the system with a low object count (45M). https://bugzilla.redhat.com/show_bug.cgi?id=1837493#c7
Exact similar behaviors have been reported in this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1770510
Deletion of radosgw objects is very expensive. This will be mitigated by moving RGW's bucket index out of omap and longer term improved for small objects in general with seastore. In the short term the fastest thing to do is redeploy the cluster - there's no way to easily delete everything.