Description of problem: Customer is having frequent RGW crashes when they attempt to restore a deleted object version from a versioning enabled bucket. Version-Release number of selected component (if applicable): RHCS 7.1.3 RHCS 6.1.6 Upstream Squid 19.2.1 How reproducible: Customer is able to consistently reproduce this crash with the following. Steps to Reproduce: 1. # Create bucket with versionning enabled aws-debug s3api create-bucket --bucket lbarbe-debug aws-debug s3api put-bucket-versioning --bucket lbarbe-debug --versioning-configuration Status=Enabled 2. # Create file > 8M and upload it dd if=/dev/zero of=file bs=9M count=1 aws-debug s3 cp file "s3://lbarbe-debug/object_with_%.txt" 3. # Delete the object aws-debug s3 rm "s3://lbarbe-debug/object_with_%.txt" 4. # Retreive the version before delete aws-debug s3api list-object-versions --bucket lbarbe-debug --prefix 'object_with_%.txt' 5. # Restore the version (versionId= in url) aws-debug s3api upload-part-copy --bucket lbarbe-debug --key 'object_with_%.txt' --copy-source 'lbarbe-debug/object_with_%.txt?versionId=R.F3FaB-tiE0-o-qNswoZwSGBstb79E' --part-number 1 --upload-id xxxx Actual results: The results in a RGW crash. Expected results: Object to be restored correctly Additional info: Customer began receiving these crashes on RHCS 6.1.6 and upgraded the cluster to 7.1.3. The specific crash log description is different however the consistency of the crashes seems the same. They also tested this against an upstream 19.2.1 cluster with the same crash details as the RHCS 7.1.3 cluster. Will be attaching core file along with more log snippets of the crashes.
The log file added for an older version of AWS-cli was tried in relation to KCS (https://access.redhat.com/solutions/7109373). However, though the CU did reduce the AWS-cli version and try the test again, they also mentioned that this is being observed with multiple clients.
We just got some new information from the CU on this. They are able to reproduce this without versioning on an object. Below is their more abbreviated reproduce and crash log for this without versioning on the object. ~~~ # Create a bucket aws-debug s3api create-bucket --bucket lbarbe-debug # Create file > 8M and upload it dd if=/dev/zero of=file bs=9M count=1 aws-debug s3 cp file "s3://lbarbe-debug/object_with_%.txt" # Copy file with multipart upload aws-debug s3api create-multipart-upload --bucket lbarbe-debug --key 'new_object.txt' aws-debug s3api upload-part-copy --bucket lbarbe-debug --key 'new_object.txt' --copy-source 'lbarbe-debug/object_with_%.txt' --part-number 1 --upload-id "2~aUmf8gH_YdOCwNxyCrk---i8-1eE0v-" ~~~ # --> RGW Crash "/lib64/libc.so.6(+0x3e730) [0x7f8d6e517730]", "/lib64/libc.so.6(+0x8ba6c) [0x7f8d6e564a6c]", "raise()", "abort()", "/lib64/libc.so.6(+0x2875b) [0x7f8d6e50175b]", "/lib64/libc.so.6(+0x373c6) [0x7f8d6e5103c6]", "(RGWObjectCtx::set_atomic(rgw_obj const&)+0xf8) [0x55e1c44658a8]", "/usr/bin/radosgw(+0x4ff208) [0x55e1c4246208]", "(RGWPutObj::verify_permission(optional_yield)+0x512) [0x55e1c4266de2]", "(rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool)+0x648) [0x55e1c41b6c58]", "(process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x1002) [0x55e1c41bdeb2]", "/usr/bin/radosgw(+0xb9cdf1) [0x55e1c48e3df1]", "/usr/bin/radosgw(+0x385e76) [0x55e1c40cce76]", "make_fcontext()" ~~~ So the crash appears exactly the same in dealing with a versioned object or not. I've asked the CU if they can test this in a bucket without versioning enabled to see if that makes a difference. I'll update once they provide those results.
Ignore previous comment #5. I missed that the CU in that test had created a completely new bucket without versioning. I thought they had simply uploaded another object to the same bucket without specifying versioning on the object.
I followed up with the customer on this issue today just to touch base. I expressed that this is really beyond just Ceph/RGW as the issue seems to be more with the AWS S3 protocol itself. Their response was the concern that a bad actor might use this as a security threat to basically crash their RGW daemons. They are much less concerned with the ability to reliably use the '%' character in object names as much as to ensure this doesn't crash the RGW daemons.
I just wanted to check in again on this for the customer with regards to their concern with potential this being used as a DOS type of attack by crashing the RGW daemons. Any thoughts on mitigation of this particular AWS S3 protocol issue?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 8.1 security, bug fix, and enhancement updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2025:9775