Bug 2249651 - multisite: DeleteObjects requests may deadlock in RGWDataChangesLog
Summary: multisite: DeleteObjects requests may deadlock in RGWDataChangesLog
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW-Multisite
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.1
Assignee: Casey Bodley
QA Contact: Hemanth Sai
Akash Raj
URL:
Whiteboard:
Depends On:
Blocks: 2267614
TreeView+ depends on / blocked
 
Reported: 2023-11-14 15:49 UTC by Casey Bodley
Modified: 2024-06-13 14:23 UTC (History)
5 users (show)

Fixed In Version: ceph-18.2.1-21.el9cp
Doc Type: Bug Fix
Doc Text:
.Ceph Object Gateway no longer deadlocks during object deletion Previously, during object deletion, the Ceph Object Gateway S3 DeleteObjects would run together with a multi-site deployment, causing the Ceph Object Gateway to deadlock and stop accepting new requests. This was caused by the DeleteObjects requests processing several object deletions at a time. With this fix, the replication logs are serialized and the deadlock is prevented.
Clone Of:
Environment:
Last Closed: 2024-06-13 14:23:17 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 63373 0 None None None 2023-11-14 15:49:15 UTC
Github ceph ceph pull 55522 0 None open rgw/datalog: RGWDataChangesLog::add_entry() uses null_yield 2024-02-09 18:57:23 UTC
Red Hat Issue Tracker RHCEPH-7914 0 None None None 2023-11-15 09:06:04 UTC
Red Hat Product Errata RHSA-2024:3925 0 None None None 2024-06-13 14:23:23 UTC

Description Casey Bodley 2023-11-14 15:49:16 UTC
Description of problem:

the s3 DeleteObjects operation was changed in https://github.com/ceph/ceph/pull/48679 to support concurrent object deletes. there have been several reports from upstream users that this leads to deadlocks when multisite is enabled

https://tracker.ceph.com/issues/63373 contains the stack traces of several threads that are blocked trying to acquire a mutex in LazyFIFO::lazy_init()


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. deploy a multisite configuration with at least 2 zones
2. create a bucket and upload many objects
3. delete the objects in bulk, for example with `s3cmd rm -r s3://some-large-bucket`

Comment 1 RHEL Program Management 2023-11-14 18:03:58 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 7 errata-xmlrpc 2024-06-13 14:23:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925


Note You need to log in before you can comment on or make changes to this bug.