Bug 1839807

Summary: After deleting pools, Ceph OSD is causing high CPU and NVMe utilisation making cluster unusable
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: karan singh <karan>
Component: RADOSAssignee: Neha Ojha <nojha>
Status: CLOSED DEFERRED QA Contact: Manohar Murthy <mmurthy>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1CC: akupczyk, bhubbard, ceph-eng-bugs, dzafman, jdurgin, kchai, nojha, rzarzyns, sseshasa, vereddy, vumrao
Target Milestone: rcKeywords: Performance
Target Release: 5.*   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-03 21:21:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description karan singh 2020-05-25 15:54:41 UTC
Description of problem:

This has occurred to me twice in the last 2 weeks, desperately needs a fix.

I am conducting benchmarking of Ceph object storage, per the workflow, I ingested 500 Million S3 64K objects into the cluster (EC 4+2 data pool). 

Due to some unrelated reason, i had to delete all the objects from the cluster. So the best quickest ways was to delete the entire RGW data pool containing 500 Million objects. As soon as i did that, i found CPU reaching to 99% on all Ceph OSD nodes. At the same time observed 100% NVMe (bluestore device) across all devices on all OSD nodes. Upon further investigation, found that
- NVMe are 100% utilized because of insane read IOs (almost no write IO)
- This is causing super high IO WAIT on the CPU, causing CPU saturation too
- The OSD devices (HDD) were idle, not doing anything.

I left the system like this for another 8 hours (overnight), later next morning the system still was unusable, 100% CPU utilization, 100% NVMe utilization. As i was running short on time, i had to purge (PV,VG,LV) and re-deploy the entire cluster  



Version-Release number of selected component (if applicable):

RHCS 4.1


How reproducible:

ALWAYS

Steps to Reproduce:
1. Fill Ceph cluster pools with high amount of objects (ex : 500 Million)
2. delete the pool which is storing those 500M objects
3. Check OSd nodes CPU utilization, NVME (Bluestore) utilization

Actual results:

Deleting large pool, causing CPU / NVMe saturation. Making the cluster unusable


Expected results:

After deleting pools (i.e all the underlying PG), Ceph cluster should reclaim all the capacity in a several minutes (if not instantaneously) and should not impact CPU / NVMe devices utilization. 

Best would be if a user deletes a pool, ceph will just logically skip those blocks and just overwrite those at the time of writing the data ? (Just a thought)


Additional info:

Comment 1 RHEL Program Management 2020-05-25 15:54:49 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 karan singh 2020-05-25 18:53:50 UTC
I have previously observed similar behaviour from the system with a low object count (45M).

https://bugzilla.redhat.com/show_bug.cgi?id=1837493#c7

Comment 3 karan singh 2020-05-25 18:55:08 UTC
Exact similar behaviors have been reported in this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1770510

Comment 4 Josh Durgin 2020-06-03 21:21:46 UTC
Deletion of radosgw objects is very expensive. This will be mitigated by moving RGW's bucket index out of omap and longer term improved for small objects in general with seastore. In the short term the fastest thing to do is redeploy the cluster - there's no way to easily delete everything.