Description of problem:
This has occurred to me twice in the last 2 weeks, desperately needs a fix.
I am conducting benchmarking of Ceph object storage, per the workflow, I ingested 500 Million S3 64K objects into the cluster (EC 4+2 data pool).
Due to some unrelated reason, i had to delete all the objects from the cluster. So the best quickest ways was to delete the entire RGW data pool containing 500 Million objects. As soon as i did that, i found CPU reaching to 99% on all Ceph OSD nodes. At the same time observed 100% NVMe (bluestore device) across all devices on all OSD nodes. Upon further investigation, found that
- NVMe are 100% utilized because of insane read IOs (almost no write IO)
- This is causing super high IO WAIT on the CPU, causing CPU saturation too
- The OSD devices (HDD) were idle, not doing anything.
I left the system like this for another 8 hours (overnight), later next morning the system still was unusable, 100% CPU utilization, 100% NVMe utilization. As i was running short on time, i had to purge (PV,VG,LV) and re-deploy the entire cluster
Version-Release number of selected component (if applicable):
RHCS 4.1
How reproducible:
ALWAYS
Steps to Reproduce:
1. Fill Ceph cluster pools with high amount of objects (ex : 500 Million)
2. delete the pool which is storing those 500M objects
3. Check OSd nodes CPU utilization, NVME (Bluestore) utilization
Actual results:
Deleting large pool, causing CPU / NVMe saturation. Making the cluster unusable
Expected results:
After deleting pools (i.e all the underlying PG), Ceph cluster should reclaim all the capacity in a several minutes (if not instantaneously) and should not impact CPU / NVMe devices utilization.
Best would be if a user deletes a pool, ceph will just logically skip those blocks and just overwrite those at the time of writing the data ? (Just a thought)
Additional info:
Comment 1RHEL Program Management
2020-05-25 15:54:49 UTC
Deletion of radosgw objects is very expensive. This will be mitigated by moving RGW's bucket index out of omap and longer term improved for small objects in general with seastore. In the short term the fastest thing to do is redeploy the cluster - there's no way to easily delete everything.