Bug 1327569 - RGW multisite: Need to frequently restart radosgw to trigger sync
Summary: RGW multisite: Need to frequently restart radosgw to trigger sync
Keywords:
Status: CLOSED DUPLICATE of bug 1327142
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 2.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: 2.0
Assignee: Casey Bodley
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
: 1327955 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-15 12:15 UTC by shilpa
Modified: 2017-07-30 16:03 UTC (History)
10 users (show)

Fixed In Version: ceph-10.2.0-1.el7cp
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-26 19:06:22 UTC
Embargoed:


Attachments (Terms of Use)

Description shilpa 2016-04-15 12:15:17 UTC
Description of problem:
Active-actve multisite configuration. Test sample set includes swift and S3 buckets and objects with file size ranging from a few kilobytes to a couple of gigabytes. Most often the sync fails.. Radosgw service has to be restarted and the sync succeeds


Version-Release number of selected component (if applicable):
ceph-radosgw-10.1.1-1.el7cp.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Configure a master zone and a secondary zone Create buckets and containers on master zone. Upload a couple of small files from one of the zones.
2. Check if the files have synced to the peer zone. 
3. Check the sync status on the destination zone and look for errors in rgw logs. 

Actual results:
Some files fail to sync. At this point any more object operations do not sync to peer. But after a radosgw restart, they succeed. 


Expected results:
The sync should not fail. Restarting radosgw everytime is not expected.

Additional info:

One of the containers that was not synced was test-container.

source rgw1 node:

# swift -A http://rgw1:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' list | grep test-container
test-container

Destination rgw2 node:

# swift -A http://rgw2:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' list | grep test-container
#

# radosgw-admin sync status --rgw-zone=us-2 --debug-rgw=0 --debug-ms=0
          realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
      zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
           zone 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is behind on 2 shards
                oldest incremental change not applied: 2016-04-15 09:30:31.0.670191s
      data sync source: acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        oldest incremental change not applied: 2016-04-15 10:06:35.0.959384s


After a radosgw restart on rgw2 node, the test-container is created.

# radosgw-admin sync status --rgw-zone=us-2 --debug-rgw=0 --debug-ms=0

          realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
      zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
           zone 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2)
  metadata sync syncing
                full sync: 0/64 shards
                metadata is caught up with master
                incremental sync: 64/64 shards
      data sync source: acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source

Comment 2 Casey Bodley 2016-04-20 15:15:10 UTC
There is a fix in upstream master that allows the gateway to recover from a sync error (https://github.com/ceph/ceph/pull/8190). We believe this is why the restarts were needed.

Comment 3 Christina Meno 2016-04-21 15:15:45 UTC
Appears to be merged upstream since this bug was filed. I'm going to move it to ON_QA as I believe it is resolved.

Comment 4 Harish NV Rao 2016-04-22 12:38:52 UTC
Hi Gregory,

Which downstream build has this fix? please let us know.

-Harish

Comment 5 Christina Meno 2016-04-25 22:42:15 UTC
*** Bug 1327955 has been marked as a duplicate of this bug. ***

Comment 7 shilpa 2016-05-24 06:02:37 UTC
This still fails. See BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1327142

Comment 8 Casey Bodley 2016-05-26 19:06:22 UTC

*** This bug has been marked as a duplicate of bug 1327142 ***


Note You need to log in before you can comment on or make changes to this bug.