Bug 2256668
| Summary: | [RGW MS]: with rgw_data_notify_interval_msec=0, boto3 objects deletion is failing with read timeout after executing bucket radoslist | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Hemanth Sai <hmaheswa> |
| Component: | RGW-Multisite | Assignee: | J. Eric Ivancich <ivancich> |
| Status: | CLOSED DUPLICATE | QA Contact: | Madhavi Kasturi <mkasturi> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.0 | CC: | ceph-eng-bugs, cephqe-warriors, ivancich, smanjara |
| Target Milestone: | --- | Keywords: | Automation, Regression |
| Target Release: | 7.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-03-22 22:54:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I tried examining magna002.ceph.redhat.com:/ceph/ceph-qe-logs/Hemanth_Sai/objects_deletion_timeout_listing_bz/, but the objects_deletion_timeout_listing_bz subdirectory was not present. Maybe it was deleted in the interim? From the script it appears `radoslist` fully completes before the listing and deletion. Is the implication that if you remove the `radoslist` step there is no issue? And that this A/B testing (with / without radoslist) is consistent over multiple runs? Have you tested this in a non-multisite environment? There are so many things going on that it's hard to know what's going on. Generally it's best to pull back enabled/used features to find the simplest configuration that will cause the failure. radoslist is a read-only tool, so this is baffling. Eric I don't see the connection between 'rgw_data_notify_interval_msec' and running radoslist. the conf option is used to turn off broadcasting of new entries in datalog changes to peer zones. even without data notification, sync mechanism will work. i'm not clear how multisite is involved at all. from the script, it appears that you are creating objects and running radoslist on the same zone. Hi Hamanth, Thanks for the update. That analysis makes sense. I think we should close this either as DUPLICATE or NOTAGBUG. What do you think? Eric *** This bug has been marked as a duplicate of bug 2253015 *** |
Description of problem: with rgw_data_notify_interval_msec=0, boto3 objects deletion is failing with read timeout after executing bucket radoslist and waiting for few minutes this test passed on pacific and quincy, it is failing on reef created a sample script based on the automation script: import boto3 import time import configparser import datetime import hashlib import json import logging import os import random import shutil import socket import string import subprocess import time def exec_shell_cmd(cmd, debug_info=False, return_err=False): try: print("executing cmd: %s" % cmd) pr = subprocess.Popen( cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=False, shell=True, ) out, err = pr.communicate() out = out.decode("utf-8", errors="ignore") err = err.decode("utf-8", errors="ignore") if pr.returncode == 0: print("cmd excuted") if out is not None: print(out) if debug_info == True: print(err) return out, err else: return out else: if return_err == True: return err raise Exception("error: %s \nreturncode: %s" % (err, pr.returncode)) except Exception as e: print("cmd execution failed") print(e) return False bucket='user1bkt3' rgw_conn = boto3.resource( "s3", aws_access_key_id="abc2", aws_secret_access_key="abc2", endpoint_url="http://10.0.210.220:80", verify=False ) bkt_conn = rgw_conn.Bucket(bucket) print(f"creating bucket {bucket}") resp = bkt_conn.create() print(f"bucket creation response: {resp}") time.sleep(15) print("creating local file") resp = exec_shell_cmd(f"base64 /dev/urandom | head -c 15KB > /home/cephuser/obj") print(resp) objects_count = 100 print(f"uploading {objects_count} objects in bucket: {bucket}") for dir_index in range(5): for obj_index in range(objects_count): obj_conn = bkt_conn.Object(f"dir_{dir_index}/obj_{obj_index}") obj_conn.upload_file('/home/cephuser/obj') time.sleep(3) print("executing radosgw-admin bucket radoslist") resp = exec_shell_cmd(f"radosgw-admin bucket radoslist --bucket {bucket}") print(resp) print("executing ceph config get mon rgw_bucket_index_max_aio") resp = exec_shell_cmd(f"ceph config get mon rgw_bucket_index_max_aio") print(resp) print("sleeping for 30 seconds") time.sleep(30) print("executing radosgw-admin bucket stats") resp = exec_shell_cmd(f"radosgw-admin bucket stats --bucket {bucket}") print(resp) print(f"listing all objects in bucket: {bucket}") objects_conn = bkt_conn.objects all_objects = objects_conn.all() print(f"all objects: {all_objects}") for obj in all_objects: print(f"object_name: {obj.key}") time.sleep(5) print(f"deleting all objects in bucket: {bucket}") response = objects_conn.delete() print(response) Version-Release number of selected component (if applicable): ceph version 18.2.0-131.el9cp How reproducible: always Steps to Reproduce: 1.deploy rhcs7.0 clusters primary and secondary 2.confiure multisite 3.create an rgw user radosgw-admin user create --uid=hmaheswa --display-name hmaheswa --access-key abc2 --secret abc2 4.set rgw_data_notify_interval_msec to 0 ceph config set client.rgw.shared.pri.ceph-pri-hsm-7x-listing-ik5svv-node5.vwlyev rgw_data_notify_interval_msec 0 5.modify endpoint-url in the script and execute it (or) the steps in the script are mentioned below, execute them using boto3 6.create a bucket 7.upload objects with pseudo directory names 8.execute radosgw-admin bucket radoslist --bucket bucketname 9.wait for 30 seconds 10.execute radosgw-admin bucket stats --bucket {bucket} 11.list all the objects 12.delete all objects at once using bucket.objects.delete() Actual results: objects deletion failing with read time out Expected results: expected objects deletion is successful Additional info: primary site rgw: 10.0.210.220 creds: cephuser/cephuser ; root/passwd debug_rgw 20 enabled logs and testing output are present at http://magna002.ceph.redhat.com/ceph-qe-logs/Hemanth_Sai/objects_deletion_timeout_listing_bz/