Bug 1884023

Summary: list pending GCs is very slow
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: John Harrigan <jharriga>
Component: RGWAssignee: Pritha Srivastava <prsrivas>
Status: CLOSED ERRATA QA Contact: Rachana Patel <racpatel>
Severity: high Docs Contact: Ranjini M N <rmandyam>
Priority: unspecified    
Version: 4.1CC: cbodley, ceph-eng-bugs, ceph-qe-bugs, kbader, mbenjamin, prsrivas, racpatel, rmandyam, sweil, tchandra, tserlin, twilkins, ukurundw, vimishra, vumrao
Target Milestone: ---Keywords: Performance
Target Release: 4.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-14.2.11-57.el8cp, ceph-14.2.11-57.el7cp Doc Type: Bug Fix
Doc Text:
.Listing of entries in the last GC object does not enter a loop Previously, the listing of entries in the last GC object entered a loop because the marker was reset every time for the last GC object. With this release, the truncated flag is updated which does not cause the marker to be reset and the listing works as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-12 14:57:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1890121    
Attachments:
Description Flags
text log of pending GC response times none

Description John Harrigan 2020-09-30 19:33:42 UTC
Description of problem:
`radosgw-admin gc list --include-all` can take over an hour to return on RHCS 4.1
On RHCS 4.0 command returns in 90sec, with same workload executing.

Version-Release number of selected component (if applicable):
RHCS 4.1

How reproducible:
yes

Steps to Reproduce:
1. Two Identical Clusters: each 8x OSD/RGW nodes (192 OSDs)
Site 1 = RHCS 4.0
Site 2 = RHCS 4.1 (14.2.8-91.el7cp)
Clusters pre-filled to 25% RAW USED
62MB mean objsz: h(1|1|50,64|64|15,8192|8192|15,65536|65536|15,1048576|1048576|5)KB
Manually set rgw_gc_obj_min_wait to 30min (2 hour lag by default)
Workload delWrite (50% delete / 50% write), 48hour runtime

2. Same polling script on both clusters, executes every three minutes:
   `radosgw-admin gc list --include-all` 

3. On RHCS 4.1 the command periodically requires 50min (or more) to complete.
   On RHCS 4.0 the command requires 90sec to return.

Actual results:
During the 48hr workload, 
Site1 (v4.0) saw 1526 samples compared to 109 in site2 (v4.1)
Site2 (v4.1) sees occasional huge spikes of #pendingGCs and slow response
One hour into workload, 17min delay
Two hours in (timestamp 17:27:44), 50min delay
Again at 18:27:41 - happens every hour, all on site2 (v4.1)
v4.0 shows much more steady progression

Expected results:
Consistent time to complete cmd producing same number of samples in 48hr runtime 
Steady increase, and decrease in pending GCs, rather than huge spikes which
coincide with very long command completion

Additional info:
Raw results gsheet  https://docs.google.com/spreadsheets/d/1spUzXxiQu3RCioo7FM9vyrOt-g58kzJ9By6s26gys84/edit#gid=126852855

Pending GCs comparision  (see attachment: w7pendingGCs.txt

Comment 1 RHEL Program Management 2020-09-30 19:33:48 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 38 Tim Wilkinson 2020-12-14 15:56:46 UTC
*** Bug 1898647 has been marked as a duplicate of this bug. ***

Comment 41 John Harrigan 2020-12-14 20:21:53 UTC
Created attachment 1739114 [details]
text log of pending GC response times

Comment 46 errata-xmlrpc 2021-01-12 14:57:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0081