Bug 2239173 - [rhcs7.0][rgw][notifications]:with kafka-ack-level=broker, objects deletion with boto3 failed with read timeout on endpoint url and observed multiple delete notifications sent for single object
Summary: [rhcs7.0][rgw][notifications]:with kafka-ack-level=broker, objects deletion w...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.0
Assignee: Yuval Lifshitz
QA Contact: Hemanth Sai
Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-09-15 17:50 UTC by Hemanth Sai
Modified: 2023-12-13 15:23 UTC (History)
5 users (show)

Fixed In Version: ceph-18.2.0-128.el9cp
Doc Type: Bug Fix
Doc Text:
.Multi-delete function notifications work as expected Previously, due to internal errors, such as a race condition in the code, the Ceph Object Gateway would crash or react unexpectedly when multi-delete functions were performed and the notifications were set for bucket deletions. With this fix, notifications for multi-delete function work as expected.
Clone Of:
Environment:
Last Closed: 2023-12-13 15:23:22 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7476 0 None None None 2023-09-15 17:52:15 UTC
Red Hat Product Errata RHBA-2023:7780 0 None None None 2023-12-13 15:23:25 UTC

Description Hemanth Sai 2023-09-15 17:50:29 UTC
Description of problem:
with kafka-ack-level=broker, objects deletion with boto3 failed with read timeout on endpoint url and observed many delete notifications sent for single object.
the first delete notification received for every object has correct object sizes and the repeated one's have object size 0
this issue is seen only for kafka-ack-type broker and persistent=false

this issue is not seen on rhcs6.1 and observed on rhcs7.0
pass log for rhcs6.1 : http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/17.2.6-136/Weekly/rgw/9/tier-2_rgw_test_bucket_notifications/
failure log on rhcs7.0 : http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.0-6/Weekly/rgw/4/tier-2_rgw_test_bucket_notifications/

also, if we try to delete all objects from a bucket not configured with notifications using boto3 resoruce, it is working fine

moreover, this issue is not seen when tried to delete objects recursively using aws-cli
AWS_ACCESS_KEY_ID=abc1 AWS_SECRET_ACCESS_KEY=abc1 aws --endpoint-url http://localhost:80 s3 rm s3://notif-bkt6 --recursive

Version-Release number of selected component (if applicable):
ceph version 18.2.0-27.el9cp

How reproducible:
always

Steps to Reproduce:
1.deploy a cluster on rhcs7.0 with rgw daemon
2.create a rgw user
radosgw-admin user create --display-name "user1" --uid user1 --access_key abc1 --secret_key abc1
3.create a bucket
AWS_ACCESS_KEY_ID=abc1 AWS_SECRET_ACCESS_KEY=abc1 aws --endpoint-url http://localhost:80 s3 mb s3://notif-bkt5
4.create a topic with kafka-ack-type=broker
AWS_ACCESS_KEY_ID=abc1 AWS_SECRET_ACCESS_KEY=abc1 aws --endpoint-url http://localhost:80 sns create-topic --name=topic_for_delete_testing5  --attributes='{"push-endpoint": "kafka://localhost:9092","kafka-ack-level":"broker", "use-ssl": "false", "verify-ssl": "false"}'
5.put bucket notifications for the bucket
AWS_ACCESS_KEY_ID=abc1 AWS_SECRET_ACCESS_KEY=abc1 aws --endpoint-url http://localhost:80 s3api put-bucket-notification-configuration  --bucket notif-bkt5 --notification-configuration='{"TopicConfigurations": [{"Id": "notif_for_delete_testing5", "TopicArn": "arn:aws:sns:shared::topic_for_delete_testing5", "Events": ["s3:ObjectCreated:*", "s3:ObjectRemoved:*"]}]}'
6.create a random file
base64 /dev/urandom | head -c 15KB > obj
7.run the below code to upload and then delete all objects at once using boto3 resource
import boto3
import time

bucket='notif-bkt5'

rgw_conn = boto3.resource(
            "s3",
            aws_access_key_id="abc1",
            aws_secret_access_key="abc1",
            endpoint_url="http://localhost:80"
        )

bkt_conn = rgw_conn.Bucket(bucket)

objects_count = 25
print(f"uploading {objects_count} objects in bucket: {bucket}")
for obj_index in range(objects_count):
    obj_conn = bkt_conn.Object(f"prefix1_obj_{obj_index}")
    obj_conn.upload_file('/home/cephuser/obj')

time.sleep(5)

print(f"listing all objects in bucket: {bucket}")
objects_conn = bkt_conn.objects
all_objects = objects_conn.all()
print(f"all objects: {all_objects}")
for obj in all_objects:
    print(f"object_name: {obj.key}")

time.sleep(5)

print(f"deleting all objects in bucket: {bucket}")
response = objects_conn.delete()
print(response)
8. the above code fails at objects deletion with read timeout error
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "http://localhost:80/notif-bkt5?delete"
9. received 105 notifications altogether for both put and delete. out of which only 25 notifcations are ObjectCreated:Put (which are correct actually) and the rest are ObjectRemoved:Delete
10. also noticed some objects are still remain in the bucket after failure with boto3 deletion
[cephuser@ceph-pri-hmaheswa-ms-rhcs7-0kgilg-node5 ~]$ AWS_ACCESS_KEY_ID=abc1 AWS_SECRET_ACCESS_KEY=abc1 aws --endpoint-url http://localhost:80 s3 ls s3://notif-bkt5
2023-09-15 12:46:07      15000 prefix1_obj_23
2023-09-15 12:46:08      15000 prefix1_obj_24
2023-09-15 12:45:55      15000 prefix1_obj_3
2023-09-15 12:45:55      15000 prefix1_obj_4
2023-09-15 12:45:56      15000 prefix1_obj_5
2023-09-15 12:45:57      15000 prefix1_obj_6
2023-09-15 12:45:57      15000 prefix1_obj_7
2023-09-15 12:45:58      15000 prefix1_obj_8
2023-09-15 12:45:59      15000 prefix1_obj_9
[cephuser@ceph-pri-hmaheswa-ms-rhcs7-0kgilg-node5 ~]$ 


Actual results:
objects deletion with boto3 resource failed with read timeout and seen repetitive delete notifications for each object

Expected results:
objects deletion with boto3 resource is successful and seen only one delete notification for each object

Additional info:
manual testing details are present in this doc: https://docs.google.com/document/d/1S3Pp3XIi8BxrjJ-JoaZzVEV0w3aGGs8ZjhZgnFwesME/edit?usp=sharing

rgw_node: 10.0.207.70
creds: cephuser/cephuser ; root/password

Comment 10 errata-xmlrpc 2023-12-13 15:23:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780


Note You need to log in before you can comment on or make changes to this bug.