Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2423160

Summary: [RGW-MS]: Incremental sync of a bucket 'testbucket3' out of 5 buckets, is stuck after a site outage and ceph service restarts
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vidushi Mishra <vimishra>
Component: RGW-MultisiteAssignee: Adam C. Emerson <aemerson>
Status: CLOSED ERRATA QA Contact: Vidushi Mishra <vimishra>
Severity: high Docs Contact:
Priority: unspecified    
Version: 9.0CC: aemerson, ceph-eng-bugs, cephqe-warriors, tserlin
Target Milestone: ---   
Target Release: 9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-20.1.0-136 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2026-01-29 07:04:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vidushi Mishra 2025-12-17 13:05:30 UTC
Description of problem:


In an RGW Multisite setup with bidirectional incremental sync, we created 5 buckets (testbucket1…testbucket5) and ran a bi-directional workload of 40M objects per bucket: 20M uploaded from each site, so each bucket should have a total of 40M objects on both zones.


After the initial uploads completed and sync was progressing, a lab outage occurred, and both sites went down (all Ceph services, including RGWs). 
When the sites came back up, we restarted mon/mgr/osd/rgw daemons on both zones, which triggered sync to resume.

Post-recovery, 4 buckets were fully synced to 40M objects, but 'testbucket3' on the secondary zone did not and stayed stuck at ~29M objects synced from the primary. 

RGW sync logs for 'testbucket3' repeatedly show a sync state/generation mismatch (“future generation” type retry), and on the secondary zone, radosgw-admin sync error list shows recurring errors for testbucket3, including “failed to sync bucket instance: Invalid argument” and "object sync failures with an “Unknown error' code.

Version-Release number of selected component (if applicable):
ceph version 20.1.0-125.el9cp 

How reproducible:
Just saw it in this environment.


Environment:
--------------

- Ceph RGW Multisite (2 zones):
- RGW topology (per zone): 6 RGWs total
  3 RGWs dedicated to sync
  3 RGWs dedicated to S3 IO/workload
- Buckets: testbucket1, testbucket2, testbucket3, testbucket4, testbucket5
- Workload scale: 40M objects per bucket
  20M objects uploaded from primary zone
  20M objects uploaded from the secondary zone
- Unexpected Failure event: Full lab outage (both sites down), followed by    restart of all Ceph daemons on both zones (mon/mgr/osd/rgw)

Observed issue bucket: testbucket3 (stuck at ~29M objects synced from primary)

Steps to Reproduce:
---------------------

1. On the RGW Multisite having 6 rgws daemons per zone, create 5 buckets    testbucket{1..5}

2. Upload 20M objects per bucket from the primary and 20M objects per bucket from the secondary (expect 40M objects/bucket after sync).

3. An unexpected full outage of both sites happened(all Ceph services/RGWs down), then bring both sites back and restart mon/mgr/osd/rgw on both zones.

4. Observe that all buckets sync to 40M except testbucket3, which remains stuck at ~29M objects and shows sync error

Actual results:

After the Lab outage, all buckets converge except testbucket3, which stays stuck at ~29M objects and never reaches 40M.

Expected results:

After both sites recover and services restart, all buckets (testbucket1–5) should sync to 40M objects each on both zones.

Additional info:
1. sharing output of the rgw log errors
2. sync error list logs in the secondary zone



sync RGW logs (having debug-rgw and debug-ms set)
==================================================
status.7262ac31-8a55-4607-bb52-cecaea67d13a:testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2 [call out=48b,read 0~156 out=156b] v0'0 uv2899154 ondisk = 0) ==== 304+0+204 (crc 0 0 0) 0x55f06e180280 con 0x55f0568ef000
2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f066648640:op=0x55f05e470800:20RGWSimpleRadosReadCRI22rgw_bucket_sync_statusE: operate()
2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f066648a00:op=0x55f0598c2000:15RGWSyncBucketCR: operate()
2025-12-17T08:34:02.198+0000 7f9550daf640 20 RGW-SYNC:data:sync:shard[91]:entry[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:1[2]]:bucket_sync_sources[source=:testbucket3[7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2]):1:source_zone=7262ac31-8a55-4607-bb52-cecaea67d13a]:bucket[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2<-testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:1]: sync status for source bucket: incremental. lease is: not taken. stop indications is: 0
2025-12-17T08:34:02.198+0000 7f9550daf640 10 RGW-SYNC:data:sync:shard[91]:entry[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:1[2]]:bucket_sync_sources[source=:testbucket3[7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2]):1:source_zone=7262ac31-8a55-4607-bb52-cecaea67d13a]:bucket[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2<-testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:1]: ERROR: requested sync of future generation 2 > 1, returning -11 for later retry
2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f066648a00:op=0x55f0598c2000:15RGWSyncBucketCR: operate() returned r=-11
2025-12-17T08:34:02.198+0000 7f9550daf640 15 stack 0x55f066648a00 end
2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: stack->operate() returned ret=-11
2025-12-17T08:34:02.198+0000 7f9550daf640 20 run: stack=0x55f066648a00 is done
2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f066648640:op=0x55f05e470800:20RGWSimpleRadosReadCRI22rgw_bucket_sync_statusE: operate()
2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f066648b40:op=0x55f060dc6100:25RGWRunBucketSourcesSyncCR: operate()
2025-12-17T08:34:02.198+0000 7f9550daf640 20 collect(): s=0x55f066648b40 stack=0x55f066648a00 encountered error (r=-11), skipping next stacks
2025-12-17T08:34:02.198+0000 7f9550daf640 10 collect() returned ret=-11
2025-12-17T08:34:02.198+0000 7f9550daf640 10 RGW-SYNC:data:sync:shard[91]:entry[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:1[2]]:bucket_sync_sources[source=:testbucket3[7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2]):1:source_zone=7262ac31-8a55-4607-bb52-cecaea67d13a]: a sync operation returned error: -11
2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f06e74a000:op=0x55f0597bd000:20RGWSimpleRadosReadCRI22rgw_bucket_sync_statusE: operate()
2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f07b79da40:op=0x55f06a8b5800:20RGWSimpleRadosReadCRI22rgw_bucket_sync_statusE: operate()
2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f066648640:op=0x55f06bdbc000:15RGWSyncBucketCR: operate()
2025-12-17T08:34:02.198+0000 7f9550daf640 20 RGW-SYNC:data:sync:shard[91]:entry[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:257[2]]:bucket_sync_sources[source=:testbucket3[7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2]):257:source_zone=7262ac31-8a55-4607-bb52-cecaea67d13a]:bucket[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2<-testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:257]: sync status for source bucket: incremental. lease is: not taken. stop indications is: 0
2025-12-17T08:34:02.198+0000 7f9550daf640 10 RGW-SYNC:data:sync:shard[91]:entry[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:257[2]]:bucket_sync_sources[source=:testbucket3[7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2]):257:source_zone=7262ac31-8a55-4607-bb52-cecaea67d13a]:bucket[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2<-testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:257]: ERROR: requested sync of future generation 2 > 1, returning -11 for later retry
2025-12-17T08:34:02.198+0000 7f9672ff3640  1 -- 10.64.1.24:0/61706117 <== osd.7 v2:10.64.1.33:6808/1971910789 4343686 ==== osd_op_reply(108630640 bucket.full-sync-status.7262ac31-8a55-4607-bb52-cecaea67d13a:testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2 [call out=48b,read 0~156 out=156b] v0'0 uv2899154 ondisk = 0) ==== 304+0+204 (crc 0 0 0) 0x55f06e180280 con 0x55f0568ef000
"ceph-client.rgw.india.sync.ceph24.yqckqc.log" 274509L, 52188461B



radosgw-admin sync error list output
====================================

[root@ceph24 cede6b96-c809-11f0-8018-04320148dc30]# radosgw-admin sync error list 
[
    {
        "shard_id": 0,
        "entries": [
            {
                "id": "1_1765555168.282990_1131705.1",
                "section": "data",
                "name": "testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:83[1]",
                "timestamp": "2025-12-12T15:59:28.282990Z",
                "info": {
                    "source_zone": "7262ac31-8a55-4607-bb52-cecaea67d13a",
                    "error_code": 22,
                    "message": "failed to sync bucket instance: (22) Invalid argument"
                }
            },
            {
                "id": "1_1765557329.882086_1134561.1",
                "section": "data",
                "name": "testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:83[1]",
                "timestamp": "2025-12-12T16:35:29.882086Z",
                "info": {
                    "source_zone": "7262ac31-8a55-4607-bb52-cecaea67d13a",
                    "error_code": 22,
                    "message": "failed to sync bucket instance: (22) Invalid argument"
                }
            },
            {
                "id": "1_1765557844.271406_1135397.1",
                "section": "data",
                "name": "testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:85/priobject1669629",
                "timestamp": "2025-12-12T16:44:04.271406Z",
                "info": {
                    "source_zone": "7262ac31-8a55-4607-bb52-cecaea67d13a",
                    "error_code": 2200,
                    "message": "failed to sync object(2200) Unknown error 2200"
                }
            },
            {
                "id": "1_1765558046.282638_1135683.1",
                "section": "data",
                "name": "testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:84/priobject15602675",
                "timestamp": "2025-12-12T16:47:26.282638Z",
                "info": {
                    "source_zone": "7262ac31-8a55-4607-bb52-cecaea67d13a",
                    "error_code": 2200,
                    "message": "failed to sync object(2200) Unknown error 2200"
                }
            },
            {
                "id": "1_1765561407.004026_1140490.1",
                "section": "data",
                "name": "testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:47[1]",
                "timestamp": "2025-12-12T17:43:27.004026Z",
                "info": {
                    "source_zone": "7262ac31-8a55-4607-bb52-cecaea67d13a",
                    "error_code": 22,
                    "message": "failed to sync bucket instance: (22) Invalid argument"
                }
            },

Comment 14 errata-xmlrpc 2026-01-29 07:04:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 9.0 Security and Enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2026:1536