1884023 – list pending GCs is very slow

Bug 1884023 - list pending GCs is very slow

Summary: list pending GCs is very slow

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RGW
Sub Component:
Version:	4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.2
Assignee:	Pritha Srivastava
QA Contact:	Rachana Patel
Docs Contact:	Ranjini M N
URL:
Whiteboard:
Depends On:
Blocks:	1890121
TreeView+	depends on / blocked

Reported:	2020-09-30 19:33 UTC by John Harrigan
Modified:	2021-06-09 16:16 UTC (History)
CC List:	15 users (show)
Fixed In Version:	ceph-14.2.11-57.el8cp, ceph-14.2.11-57.el7cp
Doc Type:	Bug Fix
Doc Text:	.Listing of entries in the last GC object does not enter a loop Previously, the listing of entries in the last GC object entered a loop because the marker was reset every time for the last GC object. With this release, the truncated flag is updated which does not cause the marker to be reset and the listing works as expected.
Clone Of:
Environment:
Last Closed:	2021-01-12 14:57:21 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
text log of pending GC response times (95.69 KB, text/plain) 2020-12-14 20:21 UTC, John Harrigan	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:0081	0	None	None	None	2021-01-12 14:57:43 UTC

Description John Harrigan 2020-09-30 19:33:42 UTC

Description of problem:
`radosgw-admin gc list --include-all` can take over an hour to return on RHCS 4.1
On RHCS 4.0 command returns in 90sec, with same workload executing.

Version-Release number of selected component (if applicable):
RHCS 4.1

How reproducible:
yes

Steps to Reproduce:
1. Two Identical Clusters: each 8x OSD/RGW nodes (192 OSDs)
Site 1 = RHCS 4.0
Site 2 = RHCS 4.1 (14.2.8-91.el7cp)
Clusters pre-filled to 25% RAW USED
62MB mean objsz: h(1|1|50,64|64|15,8192|8192|15,65536|65536|15,1048576|1048576|5)KB
Manually set rgw_gc_obj_min_wait to 30min (2 hour lag by default)
Workload delWrite (50% delete / 50% write), 48hour runtime

2. Same polling script on both clusters, executes every three minutes:
   `radosgw-admin gc list --include-all` 

3. On RHCS 4.1 the command periodically requires 50min (or more) to complete.
   On RHCS 4.0 the command requires 90sec to return.

Actual results:
During the 48hr workload, 
Site1 (v4.0) saw 1526 samples compared to 109 in site2 (v4.1)
Site2 (v4.1) sees occasional huge spikes of #pendingGCs and slow response
One hour into workload, 17min delay
Two hours in (timestamp 17:27:44), 50min delay
Again at 18:27:41 - happens every hour, all on site2 (v4.1)
v4.0 shows much more steady progression

Expected results:
Consistent time to complete cmd producing same number of samples in 48hr runtime 
Steady increase, and decrease in pending GCs, rather than huge spikes which
coincide with very long command completion

Additional info:
Raw results gsheet  https://docs.google.com/spreadsheets/d/1spUzXxiQu3RCioo7FM9vyrOt-g58kzJ9By6s26gys84/edit#gid=126852855

Pending GCs comparision  (see attachment: w7pendingGCs.txt

Comment 1 RHEL Program Management 2020-09-30 19:33:48 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 38 Tim Wilkinson 2020-12-14 15:56:46 UTC

*** Bug 1898647 has been marked as a duplicate of this bug. ***

Comment 41 John Harrigan 2020-12-14 20:21:53 UTC

Created attachment 1739114 [details]
text log of pending GC response times

Comment 46 errata-xmlrpc 2021-01-12 14:57:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0081

Note You need to log in before you can comment on or make changes to this bug.