Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2356922

Summary: Consistent, reproducible RGW crashes in special awscli upload-part-copy scenario
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Scott Nipp <snipp>
Component: RGW-MultisiteAssignee: Matt Benjamin (redhat) <mbenjamin>
Status: CLOSED ERRATA QA Contact: Chaithra <ckulal>
Severity: high Docs Contact: Rivka Pollack <rpollack>
Priority: unspecified    
Version: 7.1CC: ceph-eng-bugs, cephqe-warriors, ckulal, laurent.barbe, mbenjamin, mkasturi, rpollack, rsachere, tru, tserlin
Target Milestone: ---Flags: mkasturi: needinfo+
Target Release: 8.1   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-19.2.1-202.el9cp Doc Type: Bug Fix
Doc Text:
.Invalid URL-encoded text from the client no longer creates errors Previously, the system improperly handled scenarios where URL decoding resulted in an empty `key.name`. The empty `key.name` due to invalid URL-encoded text from the client. As a result, an assertion error during the copy operation would occur, and sometimes led to a crash later. With this fix, invalid empty `key.name` values are now ignored, and copy operations no longer trigger assertions or causes crashes.
Story Points: ---
Clone Of:
: 2369418 (view as bug list) Environment:
Last Closed: 2025-06-26 12:22:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2370192, 2351689, 2369418    

Description Scott Nipp 2025-04-02 14:22:21 UTC
Description of problem:
Customer is having frequent RGW crashes when they attempt to restore  a deleted object version from a versioning enabled bucket.

Version-Release number of selected component (if applicable):
RHCS 7.1.3
RHCS 6.1.6
Upstream Squid 19.2.1

How reproducible:
Customer is able to consistently reproduce this crash with the following.

Steps to Reproduce:
1. # Create bucket with versionning enabled
aws-debug s3api create-bucket --bucket lbarbe-debug
aws-debug s3api put-bucket-versioning  --bucket lbarbe-debug --versioning-configuration Status=Enabled
2. # Create file > 8M and upload it
dd if=/dev/zero of=file  bs=9M  count=1
aws-debug s3 cp file "s3://lbarbe-debug/object_with_%.txt"
3. # Delete the object
aws-debug s3 rm "s3://lbarbe-debug/object_with_%.txt"
4. # Retreive the version before delete
aws-debug s3api list-object-versions --bucket lbarbe-debug --prefix 'object_with_%.txt'
5. # Restore the version (versionId= in url)
aws-debug s3api upload-part-copy --bucket lbarbe-debug --key 'object_with_%.txt' --copy-source 'lbarbe-debug/object_with_%.txt?versionId=R.F3FaB-tiE0-o-qNswoZwSGBstb79E' --part-number 1 --upload-id xxxx

Actual results:
The results in a RGW crash.

Expected results:
Object to be restored correctly

Additional info:
Customer began receiving these crashes on RHCS 6.1.6 and upgraded the cluster to 7.1.3.  The specific crash log description is different however the consistency of the crashes seems the same.  They also tested this against an upstream 19.2.1 cluster with the same crash details as the RHCS 7.1.3 cluster.
Will be attaching core file along with more log snippets of the crashes.

Comment 4 Scott Nipp 2025-04-02 14:33:05 UTC
The log file added for an older version of AWS-cli was tried in relation to KCS (https://access.redhat.com/solutions/7109373).

However, though the CU did reduce the AWS-cli version and try the test again, they also mentioned that this is being observed with multiple clients.

Comment 5 Scott Nipp 2025-04-02 15:16:18 UTC
We just got some new information from the CU on this.  They are able to reproduce this without versioning on an object.  Below is their more abbreviated reproduce and crash log for this without versioning on the object.

~~~
# Create a bucket
aws-debug s3api create-bucket --bucket lbarbe-debug

# Create file > 8M and upload it
dd if=/dev/zero of=file  bs=9M  count=1
aws-debug s3 cp file "s3://lbarbe-debug/object_with_%.txt"

# Copy file with multipart upload
aws-debug s3api create-multipart-upload --bucket lbarbe-debug --key 'new_object.txt'
aws-debug s3api upload-part-copy --bucket lbarbe-debug --key 'new_object.txt' --copy-source 'lbarbe-debug/object_with_%.txt' --part-number 1 --upload-id "2~aUmf8gH_YdOCwNxyCrk---i8-1eE0v-"

~~~

# --> RGW Crash
    "/lib64/libc.so.6(+0x3e730) [0x7f8d6e517730]",
    "/lib64/libc.so.6(+0x8ba6c) [0x7f8d6e564a6c]",
    "raise()",
    "abort()",
    "/lib64/libc.so.6(+0x2875b) [0x7f8d6e50175b]",
    "/lib64/libc.so.6(+0x373c6) [0x7f8d6e5103c6]",
    "(RGWObjectCtx::set_atomic(rgw_obj const&)+0xf8) [0x55e1c44658a8]",
    "/usr/bin/radosgw(+0x4ff208) [0x55e1c4246208]",
    "(RGWPutObj::verify_permission(optional_yield)+0x512) [0x55e1c4266de2]",
    "(rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool)+0x648) [0x55e1c41b6c58]",
    "(process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x1002) [0x55e1c41bdeb2]",
    "/usr/bin/radosgw(+0xb9cdf1) [0x55e1c48e3df1]",
    "/usr/bin/radosgw(+0x385e76) [0x55e1c40cce76]",
    "make_fcontext()"
~~~
So the crash appears exactly the same in dealing with a versioned object or not.  I've asked the CU if they can test this in a bucket without versioning enabled to see if that makes a difference.  I'll update once they provide those results.

Comment 6 Scott Nipp 2025-04-02 15:24:46 UTC
Ignore previous comment #5.  I missed that the CU in that test had created a completely new bucket without versioning.  I thought they had simply uploaded another object to the same bucket without specifying versioning on the object.

Comment 8 Scott Nipp 2025-04-23 13:25:21 UTC
I followed up with the customer on this issue today just to touch base.  I expressed that this is really beyond just Ceph/RGW as the issue seems to be more with the AWS S3 protocol itself.  Their response was the concern that a bad actor might use this as a security threat to basically crash their RGW daemons.  They are much less concerned with the ability to reliably use the '%' character in object names as much as to ensure this doesn't crash the RGW daemons.

Comment 9 Scott Nipp 2025-05-12 14:58:21 UTC
I just wanted to check in again on this for the customer with regards to their concern with potential this being used as a DOS type of attack by crashing the RGW daemons.  Any thoughts on mitigation of this particular AWS S3 protocol issue?

Comment 19 errata-xmlrpc 2025-06-26 12:22:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 8.1 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:9775