Bug 2112122

Summary: [RFE] RGW should quickly respond HTTP 500 on non-serviceable requests
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Michael J. Kidd <linuxkidd>
Component: RGWAssignee: Matt Benjamin (redhat) <mbenjamin>
Status: CLOSED NOTABUG QA Contact: Madhavi Kasturi <mkasturi>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.2CC: bniver, cbodley, ceph-eng-bugs, cephqe-warriors, gjose, jdurgin, kbader, kkeithle, lithomas, mbenjamin, mmuench, nojha, vumrao
Target Milestone: ---Keywords: FutureFeature
Target Release: 6.1   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-24 18:05:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Michael J. Kidd 2022-07-28 21:35:55 UTC
Description of problem:
- When backing storage has a problem that prevents a request from being serviceable ( unfound bucket index object, for example ), the RGW will hold the connection/thread waiting indefinitely for the requested asset to become available from backing storage.
- For S3 / Swift HTTP requests, most clients will timeout after 30 to 60 seconds by default, then retry the request.
- This will eventually lead to thread pool exhaustion of the RGW, blocking all client requests.
- The only way to free up the RGW from this state is a service restart.

Version-Release number of selected component (if applicable):
- 4.2z2

How reproducible:
- 100%

Steps to Reproduce:
1. Generate an unfound object situation for a bucket index object
2. Start a loop from client(s) which incur a bucket index op for the injured bucket

Actual results:
- After thread exhaustion, RGW begins responding with 504 errors to all requests, not just those which require the injured object.

Expected results:
- Prompt HTTP 500 (or similar) error returned to the client when a request cannot be serviced due to a known lack of object availability.