Bug 2423160
| Summary: | [RGW-MS]: Incremental sync of a bucket 'testbucket3' out of 5 buckets, is stuck after a site outage and ceph service restarts | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vidushi Mishra <vimishra> |
| Component: | RGW-Multisite | Assignee: | Adam C. Emerson <aemerson> |
| Status: | CLOSED ERRATA | QA Contact: | Vidushi Mishra <vimishra> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 9.0 | CC: | aemerson, ceph-eng-bugs, cephqe-warriors, tserlin |
| Target Milestone: | --- | ||
| Target Release: | 9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | ceph-20.1.0-136 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2026-01-29 07:04:43 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 9.0 Security and Enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2026:1536 |
Description of problem: In an RGW Multisite setup with bidirectional incremental sync, we created 5 buckets (testbucket1…testbucket5) and ran a bi-directional workload of 40M objects per bucket: 20M uploaded from each site, so each bucket should have a total of 40M objects on both zones. After the initial uploads completed and sync was progressing, a lab outage occurred, and both sites went down (all Ceph services, including RGWs). When the sites came back up, we restarted mon/mgr/osd/rgw daemons on both zones, which triggered sync to resume. Post-recovery, 4 buckets were fully synced to 40M objects, but 'testbucket3' on the secondary zone did not and stayed stuck at ~29M objects synced from the primary. RGW sync logs for 'testbucket3' repeatedly show a sync state/generation mismatch (“future generation” type retry), and on the secondary zone, radosgw-admin sync error list shows recurring errors for testbucket3, including “failed to sync bucket instance: Invalid argument” and "object sync failures with an “Unknown error' code. Version-Release number of selected component (if applicable): ceph version 20.1.0-125.el9cp How reproducible: Just saw it in this environment. Environment: -------------- - Ceph RGW Multisite (2 zones): - RGW topology (per zone): 6 RGWs total 3 RGWs dedicated to sync 3 RGWs dedicated to S3 IO/workload - Buckets: testbucket1, testbucket2, testbucket3, testbucket4, testbucket5 - Workload scale: 40M objects per bucket 20M objects uploaded from primary zone 20M objects uploaded from the secondary zone - Unexpected Failure event: Full lab outage (both sites down), followed by restart of all Ceph daemons on both zones (mon/mgr/osd/rgw) Observed issue bucket: testbucket3 (stuck at ~29M objects synced from primary) Steps to Reproduce: --------------------- 1. On the RGW Multisite having 6 rgws daemons per zone, create 5 buckets testbucket{1..5} 2. Upload 20M objects per bucket from the primary and 20M objects per bucket from the secondary (expect 40M objects/bucket after sync). 3. An unexpected full outage of both sites happened(all Ceph services/RGWs down), then bring both sites back and restart mon/mgr/osd/rgw on both zones. 4. Observe that all buckets sync to 40M except testbucket3, which remains stuck at ~29M objects and shows sync error Actual results: After the Lab outage, all buckets converge except testbucket3, which stays stuck at ~29M objects and never reaches 40M. Expected results: After both sites recover and services restart, all buckets (testbucket1–5) should sync to 40M objects each on both zones. Additional info: 1. sharing output of the rgw log errors 2. sync error list logs in the secondary zone sync RGW logs (having debug-rgw and debug-ms set) ================================================== status.7262ac31-8a55-4607-bb52-cecaea67d13a:testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2 [call out=48b,read 0~156 out=156b] v0'0 uv2899154 ondisk = 0) ==== 304+0+204 (crc 0 0 0) 0x55f06e180280 con 0x55f0568ef000 2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f066648640:op=0x55f05e470800:20RGWSimpleRadosReadCRI22rgw_bucket_sync_statusE: operate() 2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f066648a00:op=0x55f0598c2000:15RGWSyncBucketCR: operate() 2025-12-17T08:34:02.198+0000 7f9550daf640 20 RGW-SYNC:data:sync:shard[91]:entry[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:1[2]]:bucket_sync_sources[source=:testbucket3[7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2]):1:source_zone=7262ac31-8a55-4607-bb52-cecaea67d13a]:bucket[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2<-testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:1]: sync status for source bucket: incremental. lease is: not taken. stop indications is: 0 2025-12-17T08:34:02.198+0000 7f9550daf640 10 RGW-SYNC:data:sync:shard[91]:entry[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:1[2]]:bucket_sync_sources[source=:testbucket3[7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2]):1:source_zone=7262ac31-8a55-4607-bb52-cecaea67d13a]:bucket[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2<-testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:1]: ERROR: requested sync of future generation 2 > 1, returning -11 for later retry 2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f066648a00:op=0x55f0598c2000:15RGWSyncBucketCR: operate() returned r=-11 2025-12-17T08:34:02.198+0000 7f9550daf640 15 stack 0x55f066648a00 end 2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: stack->operate() returned ret=-11 2025-12-17T08:34:02.198+0000 7f9550daf640 20 run: stack=0x55f066648a00 is done 2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f066648640:op=0x55f05e470800:20RGWSimpleRadosReadCRI22rgw_bucket_sync_statusE: operate() 2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f066648b40:op=0x55f060dc6100:25RGWRunBucketSourcesSyncCR: operate() 2025-12-17T08:34:02.198+0000 7f9550daf640 20 collect(): s=0x55f066648b40 stack=0x55f066648a00 encountered error (r=-11), skipping next stacks 2025-12-17T08:34:02.198+0000 7f9550daf640 10 collect() returned ret=-11 2025-12-17T08:34:02.198+0000 7f9550daf640 10 RGW-SYNC:data:sync:shard[91]:entry[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:1[2]]:bucket_sync_sources[source=:testbucket3[7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2]):1:source_zone=7262ac31-8a55-4607-bb52-cecaea67d13a]: a sync operation returned error: -11 2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f06e74a000:op=0x55f0597bd000:20RGWSimpleRadosReadCRI22rgw_bucket_sync_statusE: operate() 2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f07b79da40:op=0x55f06a8b5800:20RGWSimpleRadosReadCRI22rgw_bucket_sync_statusE: operate() 2025-12-17T08:34:02.198+0000 7f9550daf640 20 rgw rados thread: cr:s=0x55f066648640:op=0x55f06bdbc000:15RGWSyncBucketCR: operate() 2025-12-17T08:34:02.198+0000 7f9550daf640 20 RGW-SYNC:data:sync:shard[91]:entry[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:257[2]]:bucket_sync_sources[source=:testbucket3[7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2]):257:source_zone=7262ac31-8a55-4607-bb52-cecaea67d13a]:bucket[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2<-testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:257]: sync status for source bucket: incremental. lease is: not taken. stop indications is: 0 2025-12-17T08:34:02.198+0000 7f9550daf640 10 RGW-SYNC:data:sync:shard[91]:entry[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:257[2]]:bucket_sync_sources[source=:testbucket3[7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2]):257:source_zone=7262ac31-8a55-4607-bb52-cecaea67d13a]:bucket[testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2<-testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:257]: ERROR: requested sync of future generation 2 > 1, returning -11 for later retry 2025-12-17T08:34:02.198+0000 7f9672ff3640 1 -- 10.64.1.24:0/61706117 <== osd.7 v2:10.64.1.33:6808/1971910789 4343686 ==== osd_op_reply(108630640 bucket.full-sync-status.7262ac31-8a55-4607-bb52-cecaea67d13a:testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2 [call out=48b,read 0~156 out=156b] v0'0 uv2899154 ondisk = 0) ==== 304+0+204 (crc 0 0 0) 0x55f06e180280 con 0x55f0568ef000 "ceph-client.rgw.india.sync.ceph24.yqckqc.log" 274509L, 52188461B radosgw-admin sync error list output ==================================== [root@ceph24 cede6b96-c809-11f0-8018-04320148dc30]# radosgw-admin sync error list [ { "shard_id": 0, "entries": [ { "id": "1_1765555168.282990_1131705.1", "section": "data", "name": "testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:83[1]", "timestamp": "2025-12-12T15:59:28.282990Z", "info": { "source_zone": "7262ac31-8a55-4607-bb52-cecaea67d13a", "error_code": 22, "message": "failed to sync bucket instance: (22) Invalid argument" } }, { "id": "1_1765557329.882086_1134561.1", "section": "data", "name": "testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:83[1]", "timestamp": "2025-12-12T16:35:29.882086Z", "info": { "source_zone": "7262ac31-8a55-4607-bb52-cecaea67d13a", "error_code": 22, "message": "failed to sync bucket instance: (22) Invalid argument" } }, { "id": "1_1765557844.271406_1135397.1", "section": "data", "name": "testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:85/priobject1669629", "timestamp": "2025-12-12T16:44:04.271406Z", "info": { "source_zone": "7262ac31-8a55-4607-bb52-cecaea67d13a", "error_code": 2200, "message": "failed to sync object(2200) Unknown error 2200" } }, { "id": "1_1765558046.282638_1135683.1", "section": "data", "name": "testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:84/priobject15602675", "timestamp": "2025-12-12T16:47:26.282638Z", "info": { "source_zone": "7262ac31-8a55-4607-bb52-cecaea67d13a", "error_code": 2200, "message": "failed to sync object(2200) Unknown error 2200" } }, { "id": "1_1765561407.004026_1140490.1", "section": "data", "name": "testbucket3:7262ac31-8a55-4607-bb52-cecaea67d13a.354854.2:47[1]", "timestamp": "2025-12-12T17:43:27.004026Z", "info": { "source_zone": "7262ac31-8a55-4607-bb52-cecaea67d13a", "error_code": 22, "message": "failed to sync bucket instance: (22) Invalid argument" } },