Bug 2356922 - Consistent, reproducible RGW crashes in special awscli upload-part-copy scenario
Summary: Consistent, reproducible RGW crashes in special awscli upload-part-copy scenario
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW-Multisite
Version: 7.1
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: 8.1
Assignee: Matt Benjamin (redhat)
QA Contact: Chaithra
Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks: 2370192 2351689 2369418
TreeView+ depends on / blocked
 
Reported: 2025-04-02 14:22 UTC by Scott Nipp
Modified: 2025-08-13 09:40 UTC (History)
10 users (show)

Fixed In Version: ceph-19.2.1-202.el9cp
Doc Type: Bug Fix
Doc Text:
.Invalid URL-encoded text from the client no longer creates errors Previously, the system improperly handled scenarios where URL decoding resulted in an empty `key.name`. The empty `key.name` due to invalid URL-encoded text from the client. As a result, an assertion error during the copy operation would occur, and sometimes led to a crash later. With this fix, invalid empty `key.name` values are now ignored, and copy operations no longer trigger assertions or causes crashes.
Clone Of:
: 2369418 (view as bug list)
Environment:
Last Closed: 2025-06-26 12:22:11 UTC
Embargoed:
mkasturi: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-11036 0 None None None 2025-04-02 14:23:55 UTC
Red Hat Product Errata RHSA-2025:9775 0 None None None 2025-06-26 12:22:34 UTC

Description Scott Nipp 2025-04-02 14:22:21 UTC
Description of problem:
Customer is having frequent RGW crashes when they attempt to restore  a deleted object version from a versioning enabled bucket.

Version-Release number of selected component (if applicable):
RHCS 7.1.3
RHCS 6.1.6
Upstream Squid 19.2.1

How reproducible:
Customer is able to consistently reproduce this crash with the following.

Steps to Reproduce:
1. # Create bucket with versionning enabled
aws-debug s3api create-bucket --bucket lbarbe-debug
aws-debug s3api put-bucket-versioning  --bucket lbarbe-debug --versioning-configuration Status=Enabled
2. # Create file > 8M and upload it
dd if=/dev/zero of=file  bs=9M  count=1
aws-debug s3 cp file "s3://lbarbe-debug/object_with_%.txt"
3. # Delete the object
aws-debug s3 rm "s3://lbarbe-debug/object_with_%.txt"
4. # Retreive the version before delete
aws-debug s3api list-object-versions --bucket lbarbe-debug --prefix 'object_with_%.txt'
5. # Restore the version (versionId= in url)
aws-debug s3api upload-part-copy --bucket lbarbe-debug --key 'object_with_%.txt' --copy-source 'lbarbe-debug/object_with_%.txt?versionId=R.F3FaB-tiE0-o-qNswoZwSGBstb79E' --part-number 1 --upload-id xxxx

Actual results:
The results in a RGW crash.

Expected results:
Object to be restored correctly

Additional info:
Customer began receiving these crashes on RHCS 6.1.6 and upgraded the cluster to 7.1.3.  The specific crash log description is different however the consistency of the crashes seems the same.  They also tested this against an upstream 19.2.1 cluster with the same crash details as the RHCS 7.1.3 cluster.
Will be attaching core file along with more log snippets of the crashes.

Comment 4 Scott Nipp 2025-04-02 14:33:05 UTC
The log file added for an older version of AWS-cli was tried in relation to KCS (https://access.redhat.com/solutions/7109373).

However, though the CU did reduce the AWS-cli version and try the test again, they also mentioned that this is being observed with multiple clients.

Comment 5 Scott Nipp 2025-04-02 15:16:18 UTC
We just got some new information from the CU on this.  They are able to reproduce this without versioning on an object.  Below is their more abbreviated reproduce and crash log for this without versioning on the object.

~~~
# Create a bucket
aws-debug s3api create-bucket --bucket lbarbe-debug

# Create file > 8M and upload it
dd if=/dev/zero of=file  bs=9M  count=1
aws-debug s3 cp file "s3://lbarbe-debug/object_with_%.txt"

# Copy file with multipart upload
aws-debug s3api create-multipart-upload --bucket lbarbe-debug --key 'new_object.txt'
aws-debug s3api upload-part-copy --bucket lbarbe-debug --key 'new_object.txt' --copy-source 'lbarbe-debug/object_with_%.txt' --part-number 1 --upload-id "2~aUmf8gH_YdOCwNxyCrk---i8-1eE0v-"

~~~

# --> RGW Crash
    "/lib64/libc.so.6(+0x3e730) [0x7f8d6e517730]",
    "/lib64/libc.so.6(+0x8ba6c) [0x7f8d6e564a6c]",
    "raise()",
    "abort()",
    "/lib64/libc.so.6(+0x2875b) [0x7f8d6e50175b]",
    "/lib64/libc.so.6(+0x373c6) [0x7f8d6e5103c6]",
    "(RGWObjectCtx::set_atomic(rgw_obj const&)+0xf8) [0x55e1c44658a8]",
    "/usr/bin/radosgw(+0x4ff208) [0x55e1c4246208]",
    "(RGWPutObj::verify_permission(optional_yield)+0x512) [0x55e1c4266de2]",
    "(rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool)+0x648) [0x55e1c41b6c58]",
    "(process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x1002) [0x55e1c41bdeb2]",
    "/usr/bin/radosgw(+0xb9cdf1) [0x55e1c48e3df1]",
    "/usr/bin/radosgw(+0x385e76) [0x55e1c40cce76]",
    "make_fcontext()"
~~~
So the crash appears exactly the same in dealing with a versioned object or not.  I've asked the CU if they can test this in a bucket without versioning enabled to see if that makes a difference.  I'll update once they provide those results.

Comment 6 Scott Nipp 2025-04-02 15:24:46 UTC
Ignore previous comment #5.  I missed that the CU in that test had created a completely new bucket without versioning.  I thought they had simply uploaded another object to the same bucket without specifying versioning on the object.

Comment 8 Scott Nipp 2025-04-23 13:25:21 UTC
I followed up with the customer on this issue today just to touch base.  I expressed that this is really beyond just Ceph/RGW as the issue seems to be more with the AWS S3 protocol itself.  Their response was the concern that a bad actor might use this as a security threat to basically crash their RGW daemons.  They are much less concerned with the ability to reliably use the '%' character in object names as much as to ensure this doesn't crash the RGW daemons.

Comment 9 Scott Nipp 2025-05-12 14:58:21 UTC
I just wanted to check in again on this for the customer with regards to their concern with potential this being used as a DOS type of attack by crashing the RGW daemons.  Any thoughts on mitigation of this particular AWS S3 protocol issue?

Comment 19 errata-xmlrpc 2025-06-26 12:22:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 8.1 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:9775


Note You need to log in before you can comment on or make changes to this bug.