Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1327569

Summary:	RGW multisite: Need to frequently restart radosgw to trigger sync
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	shilpa <smanjara>
Component:	RGW	Assignee:	Casey Bodley <cbodley>
Status:	CLOSED DUPLICATE	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	2.0	CC:	cbodley, ceph-eng-bugs, gmeno, hnallurv, hyelloji, kbader, kdreyer, mbenjamin, owasserm, sweil
Target Milestone:	rc
Target Release:	2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-10.2.0-1.el7cp	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-05-26 19:06:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description shilpa 2016-04-15 12:15:17 UTC

Description of problem:
Active-actve multisite configuration. Test sample set includes swift and S3 buckets and objects with file size ranging from a few kilobytes to a couple of gigabytes. Most often the sync fails.. Radosgw service has to be restarted and the sync succeeds

Version-Release number of selected component (if applicable):
ceph-radosgw-10.1.1-1.el7cp.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Configure a master zone and a secondary zone Create buckets and containers on master zone. Upload a couple of small files from one of the zones.
2. Check if the files have synced to the peer zone.
3. Check the sync status on the destination zone and look for errors in rgw logs.

Actual results:
Some files fail to sync. At this point any more object operations do not sync to peer. But after a radosgw restart, they succeed.

Expected results:
The sync should not fail. Restarting radosgw everytime is not expected.

Additional info:

One of the containers that was not synced was test-container.

source rgw1 node:

# swift -A http://rgw1:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' list | grep test-container
test-container

Destination rgw2 node:

# swift -A http://rgw2:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' list | grep test-container
#

# radosgw-admin sync status --rgw-zone=us-2 --debug-rgw=0 --debug-ms=0
realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
zone 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is behind on 2 shards
oldest incremental change not applied: 2016-04-15 09:30:31.0.670191s
data sync source: acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 1 shards
oldest incremental change not applied: 2016-04-15 10:06:35.0.959384s

After a radosgw restart on rgw2 node, the test-container is created.

# radosgw-admin sync status --rgw-zone=us-2 --debug-rgw=0 --debug-ms=0

realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
zone 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2)
metadata sync syncing
full sync: 0/64 shards
metadata is caught up with master
incremental sync: 64/64 shards
data sync source: acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source

Comment 2 Casey Bodley 2016-04-20 15:15:10 UTC

There is a fix in upstream master that allows the gateway to recover from a sync error (https://github.com/ceph/ceph/pull/8190). We believe this is why the restarts were needed.

Comment 3 Christina Meno 2016-04-21 15:15:45 UTC

Appears to be merged upstream since this bug was filed. I'm going to move it to ON_QA as I believe it is resolved.

Comment 4 Harish NV Rao 2016-04-22 12:38:52 UTC

Hi Gregory,

Which downstream build has this fix? please let us know.

-Harish

Comment 5 Christina Meno 2016-04-25 22:42:15 UTC

*** Bug 1327955 has been marked as a duplicate of this bug. ***

Comment 7 shilpa 2016-05-24 06:02:37 UTC

This still fails. See BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1327142

Comment 8 Casey Bodley 2016-05-26 19:06:22 UTC


*** This bug has been marked as a duplicate of bug 1327142 ***