Description of problem: Active-actve multisite configuration. Test sample set includes swift and S3 buckets and objects with file size ranging from a few kilobytes to a couple of gigabytes. Most often the sync fails.. Radosgw service has to be restarted and the sync succeeds Version-Release number of selected component (if applicable): ceph-radosgw-10.1.1-1.el7cp.x86_64 How reproducible: Always Steps to Reproduce: 1. Configure a master zone and a secondary zone Create buckets and containers on master zone. Upload a couple of small files from one of the zones. 2. Check if the files have synced to the peer zone. 3. Check the sync status on the destination zone and look for errors in rgw logs. Actual results: Some files fail to sync. At this point any more object operations do not sync to peer. But after a radosgw restart, they succeed. Expected results: The sync should not fail. Restarting radosgw everytime is not expected. Additional info: One of the containers that was not synced was test-container. source rgw1 node: # swift -A http://rgw1:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' list | grep test-container test-container Destination rgw2 node: # swift -A http://rgw2:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' list | grep test-container # # radosgw-admin sync status --rgw-zone=us-2 --debug-rgw=0 --debug-ms=0 realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth) zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us) zone 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is behind on 2 shards oldest incremental change not applied: 2016-04-15 09:30:31.0.670191s data sync source: acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 1 shards oldest incremental change not applied: 2016-04-15 10:06:35.0.959384s After a radosgw restart on rgw2 node, the test-container is created. # radosgw-admin sync status --rgw-zone=us-2 --debug-rgw=0 --debug-ms=0 realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth) zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us) zone 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2) metadata sync syncing full sync: 0/64 shards metadata is caught up with master incremental sync: 64/64 shards data sync source: acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source
There is a fix in upstream master that allows the gateway to recover from a sync error (https://github.com/ceph/ceph/pull/8190). We believe this is why the restarts were needed.
Appears to be merged upstream since this bug was filed. I'm going to move it to ON_QA as I believe it is resolved.
Hi Gregory, Which downstream build has this fix? please let us know. -Harish
*** Bug 1327955 has been marked as a duplicate of this bug. ***
This still fails. See BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1327142
*** This bug has been marked as a duplicate of bug 1327142 ***