Bug 1327569

Summary: RGW multisite: Need to frequently restart radosgw to trigger sync
Product: Red Hat Ceph Storage Reporter: shilpa <smanjara>
Component: RGWAssignee: Casey Bodley <cbodley>
Status: CLOSED DUPLICATE QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 2.0CC: cbodley, ceph-eng-bugs, gmeno, hnallurv, hyelloji, kbader, kdreyer, mbenjamin, owasserm, sweil
Target Milestone: rc   
Target Release: 2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-10.2.0-1.el7cp Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-26 19:06:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description shilpa 2016-04-15 12:15:17 UTC
Description of problem:
Active-actve multisite configuration. Test sample set includes swift and S3 buckets and objects with file size ranging from a few kilobytes to a couple of gigabytes. Most often the sync fails.. Radosgw service has to be restarted and the sync succeeds


Version-Release number of selected component (if applicable):
ceph-radosgw-10.1.1-1.el7cp.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Configure a master zone and a secondary zone Create buckets and containers on master zone. Upload a couple of small files from one of the zones.
2. Check if the files have synced to the peer zone. 
3. Check the sync status on the destination zone and look for errors in rgw logs. 

Actual results:
Some files fail to sync. At this point any more object operations do not sync to peer. But after a radosgw restart, they succeed. 


Expected results:
The sync should not fail. Restarting radosgw everytime is not expected.

Additional info:

One of the containers that was not synced was test-container.

source rgw1 node:

# swift -A http://rgw1:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' list | grep test-container
test-container

Destination rgw2 node:

# swift -A http://rgw2:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' list | grep test-container
#

# radosgw-admin sync status --rgw-zone=us-2 --debug-rgw=0 --debug-ms=0
          realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
      zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
           zone 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is behind on 2 shards
                oldest incremental change not applied: 2016-04-15 09:30:31.0.670191s
      data sync source: acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        oldest incremental change not applied: 2016-04-15 10:06:35.0.959384s


After a radosgw restart on rgw2 node, the test-container is created.

# radosgw-admin sync status --rgw-zone=us-2 --debug-rgw=0 --debug-ms=0

          realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
      zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
           zone 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2)
  metadata sync syncing
                full sync: 0/64 shards
                metadata is caught up with master
                incremental sync: 64/64 shards
      data sync source: acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source

Comment 2 Casey Bodley 2016-04-20 15:15:10 UTC
There is a fix in upstream master that allows the gateway to recover from a sync error (https://github.com/ceph/ceph/pull/8190). We believe this is why the restarts were needed.

Comment 3 Christina Meno 2016-04-21 15:15:45 UTC
Appears to be merged upstream since this bug was filed. I'm going to move it to ON_QA as I believe it is resolved.

Comment 4 Harish NV Rao 2016-04-22 12:38:52 UTC
Hi Gregory,

Which downstream build has this fix? please let us know.

-Harish

Comment 5 Christina Meno 2016-04-25 22:42:15 UTC
*** Bug 1327955 has been marked as a duplicate of this bug. ***

Comment 7 shilpa 2016-05-24 06:02:37 UTC
This still fails. See BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1327142

Comment 8 Casey Bodley 2016-05-26 19:06:22 UTC

*** This bug has been marked as a duplicate of bug 1327142 ***