1944009 – [GSS] "RGWReshardLock::lock failed to acquire lock on reshard.0000000002 ret=-16" messages are reported in rgw log

Bug 1944009 - [GSS] "RGWReshardLock::lock failed to acquire lock on reshard.0000000002 ret=-16" messages are reported in rgw log

Summary: [GSS] "RGWReshardLock::lock failed to acquire lock on reshard.0000000002 ret=...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RGW
Sub Component:
Version:	4.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	5.1
Assignee:	J. Eric Ivancich
QA Contact:	Vidushi Mishra
Docs Contact:	Ranjini M N
URL:
Whiteboard:
Depends On:
Blocks:	2031073
TreeView+	depends on / blocked

Reported:	2021-03-29 03:23 UTC by hhuan
Modified:	2024-06-14 01:03 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ceph-16.2.6-1.el8cp
Doc Type:	Enhancement
Doc Text:	.Lock contention messages from the Ceph Object Gateway reshard queue are marked as informational Previously, when the Ceph Object Gateway failed to get a lock on a reshard queue, the output log entry would appear to be an error causing concern to customers. With this release, the entries in the output log appear as informational and are tagged as “INFO:”.
Clone Of:
Environment:
Last Closed:	2022-04-04 10:19:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph pull 40862	0	None	closed	rgw: during reshard lock contention, adjust logging	2021-04-30 22:07:14 UTC
Red Hat Product Errata	RHSA-2022:1174	0	None	None	None	2022-04-04 10:20:13 UTC

Comment 3 J. Eric Ivancich 2021-04-14 17:07:13 UTC

So to more directly address customer, locks are a way for processes running in parallel to coordinate their access to shared objects/data. We would not want each of the RGW processes to simultaneously process the same reshard log, so the first one to try acquires the lock, the second one is locked out for the duration, and finally the first one releases the lock.

The customer clearly diagnosed this when they write: "Enable rgw debug log on the first rgw node in test env, find that the error msg is logged when another RGW daemon already acquired lock for reshard.000000000x:"

So the links to an analogous situation with LC (lifecycle) logs are relevant in that although based on a different subsystem of RGW, it's ultimately the same underlying issue.

I think the best course is to mark these messages INFOs rather than WARNINGs or ERRORs, so they don't raise unnecessary concern. If that's the case, remaining at log level 0 would not be an issue.

I'll put together a fix and target it for 5.1.

Eric

Comment 5 J. Eric Ivancich 2021-04-14 20:20:55 UTC

The upstream PR to address this can be found at https://github.com/ceph/ceph/pull/40862 .

Comment 6 J. Eric Ivancich 2021-04-14 20:46:27 UTC

The commit used from the pr linked to in comment #4 is 6d3dee37791ad427a3435c493a1d7874ba075674 .

Comment 21 errata-xmlrpc 2022-04-04 10:19:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1174

Note You need to log in before you can comment on or make changes to this bug.