Description of problem:
In a multisite environment where the master zone and non-master zone were swapped, after uploading few objects from both the zones, the master zone had all the files uploaded and synced. The non-master zone had stopped syncing objects. Later, tried to create a bucket on the non-master zone and that failed too.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Create a multisite configuration and upload and sync objects with rgw1 master and rgw2 non-master.
2. Switch rgw2 to master zone, do a period update commit and restart gateways.
3. Upload objects and check sync status.
All the objects got synced to the new master zone, rgw2. While the first few objects synced to rgw1, all the other object uploads and sync fail with:
('Connection aborted.', error(110, 'Connection timed out')
It looks like magna115 was configured as the initial master zone us-1, correct? I'm not seeing any evidence in the logs that it was ever restarted:
$ grep -n --binary-files=text 'ceph version' ceph-rgw-magna115.log-20160716
2:2016-07-15 07:14:33.982866 7f78170c59c0 0 ceph version 10.2.2-18.el7cp (408019449adec8263014b356737cf326544ea7c6), process radosgw, pid 26186
I also searched for the 'period commit', and found two instances of the 'post_period' request:
$ grep --binary-files=text 'post_period:http' ceph-rgw-magna115.log-20160716
2016-07-15 07:14:40.563546 7f762d7da700 2 req 97:1.024208::POST /admin/realm/period:post_period:http status=200
2016-07-15 09:48:01.171323 7f7634fe9700 2 req 87458:1.006765::POST /admin/realm/period:post_period:http status=200
Both of those included the message "period epoch 1 is not newer than current epoch 1, discarding update", so the period configuration doesn't appear to have changed since it was started at time 07:14:33.
Are the logs for zone us-2 available anywhere? Is there any way that your 'zone modify' and 'period update --commit' commands are still in scrollback, so you could copy/paste their output?
Due to a bug in how we update the sync status markers, we were skipping past bucket entries that hadn't completed.
Yehuda's PR at https://github.com/ceph/ceph/pull/10355 should fix this.
Tested on ceph-10.2.2-26.el7cp along with rgw_thread_pool_size=200. I don't see the issue anymore.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.