Bug 2239173

Summary: [rhcs7.0][rgw][notifications]:with kafka-ack-level=broker, objects deletion with boto3 failed with read timeout on endpoint url and observed multiple delete notifications sent for single object
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Hemanth Sai <hmaheswa>
Component: RGWAssignee: Yuval Lifshitz <ylifshit>
Status: CLOSED ERRATA QA Contact: Hemanth Sai <hmaheswa>
Severity: high Docs Contact: Rivka Pollack <rpollack>
Priority: unspecified    
Version: 7.0CC: akraj, ceph-eng-bugs, cephqe-warriors, tserlin, ylifshit
Target Milestone: ---Keywords: Automation, Regression
Target Release: 7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-18.2.0-128.el9cp Doc Type: Bug Fix
Doc Text:
.Multi-delete function notifications work as expected Previously, due to internal errors, such as a race condition in the code, the Ceph Object Gateway would crash or react unexpectedly when multi-delete functions were performed and the notifications were set for bucket deletions. With this fix, notifications for multi-delete function work as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-12-13 15:23:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hemanth Sai 2023-09-15 17:50:29 UTC
Description of problem:
with kafka-ack-level=broker, objects deletion with boto3 failed with read timeout on endpoint url and observed many delete notifications sent for single object.
the first delete notification received for every object has correct object sizes and the repeated one's have object size 0
this issue is seen only for kafka-ack-type broker and persistent=false

this issue is not seen on rhcs6.1 and observed on rhcs7.0
pass log for rhcs6.1 : http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/17.2.6-136/Weekly/rgw/9/tier-2_rgw_test_bucket_notifications/
failure log on rhcs7.0 : http://magna002.ceph.redhat.com/cephci-jenkins/test-runs/18.2.0-6/Weekly/rgw/4/tier-2_rgw_test_bucket_notifications/

also, if we try to delete all objects from a bucket not configured with notifications using boto3 resoruce, it is working fine

moreover, this issue is not seen when tried to delete objects recursively using aws-cli
AWS_ACCESS_KEY_ID=abc1 AWS_SECRET_ACCESS_KEY=abc1 aws --endpoint-url http://localhost:80 s3 rm s3://notif-bkt6 --recursive

Version-Release number of selected component (if applicable):
ceph version 18.2.0-27.el9cp

How reproducible:
always

Steps to Reproduce:
1.deploy a cluster on rhcs7.0 with rgw daemon
2.create a rgw user
radosgw-admin user create --display-name "user1" --uid user1 --access_key abc1 --secret_key abc1
3.create a bucket
AWS_ACCESS_KEY_ID=abc1 AWS_SECRET_ACCESS_KEY=abc1 aws --endpoint-url http://localhost:80 s3 mb s3://notif-bkt5
4.create a topic with kafka-ack-type=broker
AWS_ACCESS_KEY_ID=abc1 AWS_SECRET_ACCESS_KEY=abc1 aws --endpoint-url http://localhost:80 sns create-topic --name=topic_for_delete_testing5  --attributes='{"push-endpoint": "kafka://localhost:9092","kafka-ack-level":"broker", "use-ssl": "false", "verify-ssl": "false"}'
5.put bucket notifications for the bucket
AWS_ACCESS_KEY_ID=abc1 AWS_SECRET_ACCESS_KEY=abc1 aws --endpoint-url http://localhost:80 s3api put-bucket-notification-configuration  --bucket notif-bkt5 --notification-configuration='{"TopicConfigurations": [{"Id": "notif_for_delete_testing5", "TopicArn": "arn:aws:sns:shared::topic_for_delete_testing5", "Events": ["s3:ObjectCreated:*", "s3:ObjectRemoved:*"]}]}'
6.create a random file
base64 /dev/urandom | head -c 15KB > obj
7.run the below code to upload and then delete all objects at once using boto3 resource
import boto3
import time

bucket='notif-bkt5'

rgw_conn = boto3.resource(
            "s3",
            aws_access_key_id="abc1",
            aws_secret_access_key="abc1",
            endpoint_url="http://localhost:80"
        )

bkt_conn = rgw_conn.Bucket(bucket)

objects_count = 25
print(f"uploading {objects_count} objects in bucket: {bucket}")
for obj_index in range(objects_count):
    obj_conn = bkt_conn.Object(f"prefix1_obj_{obj_index}")
    obj_conn.upload_file('/home/cephuser/obj')

time.sleep(5)

print(f"listing all objects in bucket: {bucket}")
objects_conn = bkt_conn.objects
all_objects = objects_conn.all()
print(f"all objects: {all_objects}")
for obj in all_objects:
    print(f"object_name: {obj.key}")

time.sleep(5)

print(f"deleting all objects in bucket: {bucket}")
response = objects_conn.delete()
print(response)
8. the above code fails at objects deletion with read timeout error
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "http://localhost:80/notif-bkt5?delete"
9. received 105 notifications altogether for both put and delete. out of which only 25 notifcations are ObjectCreated:Put (which are correct actually) and the rest are ObjectRemoved:Delete
10. also noticed some objects are still remain in the bucket after failure with boto3 deletion
[cephuser@ceph-pri-hmaheswa-ms-rhcs7-0kgilg-node5 ~]$ AWS_ACCESS_KEY_ID=abc1 AWS_SECRET_ACCESS_KEY=abc1 aws --endpoint-url http://localhost:80 s3 ls s3://notif-bkt5
2023-09-15 12:46:07      15000 prefix1_obj_23
2023-09-15 12:46:08      15000 prefix1_obj_24
2023-09-15 12:45:55      15000 prefix1_obj_3
2023-09-15 12:45:55      15000 prefix1_obj_4
2023-09-15 12:45:56      15000 prefix1_obj_5
2023-09-15 12:45:57      15000 prefix1_obj_6
2023-09-15 12:45:57      15000 prefix1_obj_7
2023-09-15 12:45:58      15000 prefix1_obj_8
2023-09-15 12:45:59      15000 prefix1_obj_9
[cephuser@ceph-pri-hmaheswa-ms-rhcs7-0kgilg-node5 ~]$ 


Actual results:
objects deletion with boto3 resource failed with read timeout and seen repetitive delete notifications for each object

Expected results:
objects deletion with boto3 resource is successful and seen only one delete notification for each object

Additional info:
manual testing details are present in this doc: https://docs.google.com/document/d/1S3Pp3XIi8BxrjJ-JoaZzVEV0w3aGGs8ZjhZgnFwesME/edit?usp=sharing

rgw_node: 10.0.207.70
creds: cephuser/cephuser ; root/password

Comment 10 errata-xmlrpc 2023-12-13 15:23:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780