Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2256668

Summary: [RGW MS]: with rgw_data_notify_interval_msec=0, boto3 objects deletion is failing with read timeout after executing bucket radoslist
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Hemanth Sai <hmaheswa>
Component: RGW-MultisiteAssignee: J. Eric Ivancich <ivancich>
Status: CLOSED DUPLICATE QA Contact: Madhavi Kasturi <mkasturi>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.0CC: ceph-eng-bugs, cephqe-warriors, ivancich, smanjara
Target Milestone: ---Keywords: Automation, Regression
Target Release: 7.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-03-22 22:54:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hemanth Sai 2024-01-03 17:39:48 UTC
Description of problem:
with rgw_data_notify_interval_msec=0, boto3 objects deletion is failing with read timeout after executing bucket radoslist and waiting for few minutes

this test passed on pacific and quincy, it is failing on reef

created a sample script based on the automation script:

import boto3
import time

import configparser
import datetime
import hashlib
import json
import logging
import os
import random
import shutil
import socket
import string
import subprocess
import time


def exec_shell_cmd(cmd, debug_info=False, return_err=False):
    try:
        print("executing cmd: %s" % cmd)
        pr = subprocess.Popen(
            cmd,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            universal_newlines=False,
            shell=True,
        )
        out, err = pr.communicate()
        out = out.decode("utf-8", errors="ignore")
        err = err.decode("utf-8", errors="ignore")
        if pr.returncode == 0:
            print("cmd excuted")
            if out is not None:
                print(out)
                if debug_info == True:
                    print(err)
                    return out, err
                else:
                    return out
        else:
            if return_err == True:
                return err
            raise Exception("error: %s \nreturncode: %s" % (err, pr.returncode))
    except Exception as e:
        print("cmd execution failed")
        print(e)
        return False

bucket='user1bkt3'

rgw_conn = boto3.resource(
            "s3",
            aws_access_key_id="abc2",
            aws_secret_access_key="abc2",
            endpoint_url="http://10.0.210.220:80",
            verify=False
        )

bkt_conn = rgw_conn.Bucket(bucket)
print(f"creating bucket {bucket}")
resp = bkt_conn.create()
print(f"bucket creation response: {resp}")
time.sleep(15)

print("creating local file")
resp = exec_shell_cmd(f"base64 /dev/urandom | head -c 15KB > /home/cephuser/obj")
print(resp)

objects_count = 100
print(f"uploading {objects_count} objects in bucket: {bucket}")
for dir_index in range(5):
    for obj_index in range(objects_count):
        obj_conn = bkt_conn.Object(f"dir_{dir_index}/obj_{obj_index}")
        obj_conn.upload_file('/home/cephuser/obj')
time.sleep(3)


print("executing radosgw-admin bucket radoslist")
resp = exec_shell_cmd(f"radosgw-admin bucket radoslist --bucket {bucket}")
print(resp)

print("executing ceph config get mon rgw_bucket_index_max_aio")
resp = exec_shell_cmd(f"ceph config get mon rgw_bucket_index_max_aio")
print(resp)

print("sleeping for 30 seconds")
time.sleep(30)

print("executing radosgw-admin bucket stats")
resp = exec_shell_cmd(f"radosgw-admin bucket stats --bucket {bucket}")
print(resp)

print(f"listing all objects in bucket: {bucket}")
objects_conn = bkt_conn.objects
all_objects = objects_conn.all()
print(f"all objects: {all_objects}")
for obj in all_objects:
    print(f"object_name: {obj.key}")
time.sleep(5)

print(f"deleting all objects in bucket: {bucket}")
response = objects_conn.delete()
print(response)





Version-Release number of selected component (if applicable):
ceph version 18.2.0-131.el9cp

How reproducible:
always

Steps to Reproduce:
1.deploy rhcs7.0 clusters primary and secondary
2.confiure multisite
3.create an rgw user
radosgw-admin user create --uid=hmaheswa --display-name hmaheswa --access-key abc2 --secret abc2
4.set rgw_data_notify_interval_msec to 0
ceph config set client.rgw.shared.pri.ceph-pri-hsm-7x-listing-ik5svv-node5.vwlyev rgw_data_notify_interval_msec 0
5.modify endpoint-url in the script and execute it (or) the steps in the script are mentioned below, execute them using boto3
6.create a bucket
7.upload objects with pseudo directory names
8.execute radosgw-admin bucket radoslist --bucket bucketname
9.wait for 30 seconds
10.execute radosgw-admin bucket stats --bucket {bucket}
11.list all the objects
12.delete all objects at once using bucket.objects.delete()

Actual results:
objects deletion failing with read time out

Expected results:
expected objects deletion is successful 

Additional info:
primary site rgw: 10.0.210.220
creds: cephuser/cephuser ; root/passwd

debug_rgw 20 enabled logs and testing output are present at http://magna002.ceph.redhat.com/ceph-qe-logs/Hemanth_Sai/objects_deletion_timeout_listing_bz/

Comment 2 J. Eric Ivancich 2024-03-20 16:47:43 UTC
I tried examining magna002.ceph.redhat.com:/ceph/ceph-qe-logs/Hemanth_Sai/objects_deletion_timeout_listing_bz/, but the objects_deletion_timeout_listing_bz subdirectory was not present. Maybe it was deleted in the interim?

From the script it appears `radoslist` fully completes before the listing and deletion. Is the implication that if you remove the `radoslist` step there is no issue? And that this A/B testing (with / without radoslist) is consistent over multiple runs?

Have you tested this in a non-multisite environment?

There are so many things going on that it's hard to know what's going on. Generally it's best to pull back enabled/used features to find the simplest configuration that will cause the failure.

radoslist is a read-only tool, so this is baffling.

Eric

Comment 3 shilpa 2024-03-21 15:11:33 UTC
I don't see the connection between 'rgw_data_notify_interval_msec' and running radoslist. the conf option is used to turn off broadcasting of new entries in datalog changes to peer zones.
even without data notification, sync mechanism will work. i'm not clear how multisite is involved at all. from the script, it appears that you are creating objects and running radoslist on the same zone.

Comment 5 J. Eric Ivancich 2024-03-21 16:21:04 UTC
Hi Hamanth,

Thanks for the update. That analysis makes sense. I think we should close this either as DUPLICATE or NOTAGBUG. What do you think?

Eric

Comment 7 J. Eric Ivancich 2024-03-22 22:54:58 UTC

*** This bug has been marked as a duplicate of bug 2253015 ***