Bug 1362639

Summary: Object sync requests skipped in some scenarios
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: shilpa <smanjara>
Component: RGWAssignee: Matt Benjamin (redhat) <mbenjamin>
Status: CLOSED WONTFIX QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact: Bara Ancincova <bancinco>
Priority: unspecified    
Version: 2.0CC: anharris, bancinco, cbodley, ceph-eng-bugs, hnallurv, kbader, kdreyer, mbenjamin, owasserm, smanjara, sweil, yehuda
Target Milestone: rc   
Target Release: 3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
.Object sync requests are sometimes skipped In multi-site configurations of the Ceph Object Gateway, a non-master zone can be promoted to the master zone. In most cases, the master zone's gateway or gateways are still running when this happens. However, if the gateways are down, it can take up to 30 seconds after their restart until the gateways notice that another zone was promoted. During this time, the gateways can miss changes to buckets that occur on other zones. Consequently, object sync requests are skipped. To work around this issue, pull the new master's period to the old master zone before restarting the old master zone: ---- $ radosgw-admin period pull --remote=<new-master-zone-id> ---- For details on pulling the period, see the https://access.redhat.com/documentation/en/red-hat-ceph-storage/2/single/object-gateway-guide-for-red-hat-enterprise-linux[Ceph Object Gateway Guide for Red Hat Enterprise Linux] or the https://access.redhat.com/documentation/en/red-hat-ceph-storage/2/single/object-gateway-guide-for-ubuntu[Ceph Object Gateway Guide for Ubuntu].
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-06 14:37:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1322504, 1383917, 1412948    

Description shilpa 2016-08-02 18:02:31 UTC
Version-Release number of selected component (if applicable):
ceph-radosgw-10.2.2-32.el7cp.x86_64

Steps to Reproduce:
1. Stop rgw process on master. Switch non-master zone to master.
3. Create new buckets and objects on current master.
4. Bring up rgw process on the old master zone.
5. Upload some more buckets and objects on the current master.
5. Check for sync

Actual results:
All the buckets were synced. The objects created before rgw process restart failed to sync but the other objects created later synced successfully.

I don't see a GET request sent to master on the buckets that are missing objects.

Comment 10 Casey Bodley 2016-08-11 14:01:57 UTC
Thanks Bara,

The only part that I'd clarify is "During this process, the master zone is down and object sync requests can be skipped under certain circumstances."

I'd suggest replacing that sentence with:
Generally, the master zone's gateway(s) will still be running when this happens. But in the case where its gateways are all down, it can take up to 30 seconds after restarting for them to notice that another zone was promoted. During this window, they can miss some changes to buckets that occur on other zones.

Comment 12 Casey Bodley 2016-08-11 14:25:52 UTC
Looks good!

Comment 13 shilpa 2016-08-17 09:27:38 UTC
Hi Bara,

Is this yet to be documented in the release notes?

Comment 17 Matt Benjamin (redhat) 2016-10-03 17:57:03 UTC
The behavior is expected in current multisite sync.  In future releases, multisite sync may support more complex failover and recovery scenarios, there is no upstream tracker issue yet.