Bug 1420221 - Metadata sync fails after a radosgw restart on the third zone in three-way multisite env
Summary: Metadata sync fails after a radosgw restart on the third zone in three-way mu...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RGW
Version: 2.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 2.2
Assignee: Orit Wasserman
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-08 08:51 UTC by shilpa
Modified: 2017-07-30 16:03 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-15 12:48:50 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description shilpa 2017-02-08 08:51:12 UTC
Description of problem:
Stop radosgw process on a non-master zone in three-way multisite env. Continue doing I/o operations on the other zones. Start the rgw process. Creation of new buckets fail to sync post restart.

Version-Release number of selected component (if applicable):
ceph-radosgw-10.2.5-13.el7cp.x86_64


Steps to Reproduce:
1. Configure three-way multisite clusters with one zone in each site.
2. Create buckets and objects and ensure that all of them have synced.
3. Stop radosgw on a non-master zone.
4. Continue writing to other zones.
5. Start radosgw on the zone


Actual results:
The objects written to existing buckets sync to the zone where rgw was restarted. But fails to sync new bucket create operations after restart.


Additional info:

These errors occur in radosgw-admin sync status command:

2017-02-08 08:38:11.720841 7f79648d19c0 -1 ERROR: could not find remote sync shard status for shard_id=122
2017-02-08 08:38:11.720842 7f79648d19c0 -1 ERROR: could not find remote sync shard status for shard_id=123
2017-02-08 08:38:11.720842 7f79648d19c0 -1 ERROR: could not find remote sync shard status for shard_id=124
2017-02-08 08:38:11.720843 7f79648d19c0 -1 ERROR: could not find remote sync shard status for shard_id=125
2017-02-08 08:38:11.720844 7f79648d19c0 -1 ERROR: could not find remote sync shard status for shard_id=126
2017-02-08 08:38:11.720844 7f79648d19c0 -1 ERROR: could not find remote sync shard status for shard_id=127

Checking logs for shard_id=124
2017-02-08 07:32:49.105363 7f5af3fff700 10 sync: incremental_sync: shard_id=124 r=-22
2017-02-08 07:32:49.105364 7f5af3fff700 20 cr:s=0x7f5ae80e0d30:op=0x7f5ae8befa60:18RGWDataSyncShardCR: operate() returned r=-22
2017-02-08 07:32:49.105366 7f5af3fff700 20 cr:s=0x7f5ae80e0d30:op=0x7f5ae80e03f0:25RGWDataSyncShardControlCR: operate()
2017-02-08 07:32:49.105372 7f5af3fff700  5 Sync:94b94a1a:data:DataShard:datalog.sync-status.shard.94b94a1a-6aa1-4944-9064-a5ae68bf3811.124:finish
2017-02-08 07:32:49.105374 7f5af3fff700  0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -22
2017-02-08 07:32:49.105378 7f5af3fff700 20 run: stack=0x7f5ae80e0d30 is io blocked
2017-02-08 07:32:49.105608 7f5af3fff700 20 cr:s=0x7f5ae8bf65f0:op=0x7f5ae81de070:21RGWRadosSetOmapKeysCR: operate()
2017-02-08 07:32:49.105615 7f5af3fff700 20 cr:s=0x7f5ae8bf65f0:op=0x7f5ae81de070:21RGWRadosSetOmapKeysCR: operate()
2017-02-08 07:32:49.105617 7f5af3fff700 20 cr:s=0x7f5ae8bf65f0:op=0x7f5ae81de070:21RGWRadosSetOmapKeysCR: operate()
2017-02-08 07:32:49.105618 7f5af3fff700 20 cr:s=0x7f5ae8bf65f0:op=0x7f5ae81de070:21RGWRadosSetOmapKeysCR: operate()
2017-02-08 07:32:49.105622 7f5af3fff700 20 cr:s=0x7f5ae8bf65f0:op=0x7f5ae8bf6870:13RGWOmapAppend: operate()
2017-02-08 07:32:49.105626 7f5af3fff700 20 run: stack=0x7f5ae8bf65f0 is done
2017-02-08 07:32:49.105628 7f5af3fff700 20 cr:s=0x7f5ae80fbdd0:op=0x7f5ae8bf48c0:18RGWDataSyncShardCR: operate()
2017-02-08 07:32:49.105629 7f5af3fff700 20 collect(): s=0x7f5ae80fbdd0 stack=0x7f5ae8bf65f0 is complete
2017-02-08 07:32:49.105630 7f5af3fff700 20 collect(): s=0x7f5ae80fbdd0 stack=0x7f5ae8bf7c30 is still running
2017-02-08 07:32:49.105631 7f5af3fff700 20 run: stack=0x7f5ae80fbdd0 is_blocked_by_stack()=0 is_sleeping=0 waiting_for_child()=1
2017-02-08 07:32:49.105761 7f5af3fff700 20 cr:s=0x7f5ae8bf7c30:op=0x7f5ae83893c0:22RGWSimpleRadosUnlockCR: operate()
2017-02-08 07:32:49.105767 7f5af3fff700 20 cr:s=0x7f5ae8bf7c30:op=0x7f5ae83893c0:22RGWSimpleRadosUnlockCR: operate()
2017-02-08 07:32:49.105769 7f5af3fff700 20 cr:s=0x7f5ae8bf7c30:op=0x7f5ae83893c0:22RGWSimpleRadosUnlockCR: operate()
2017-02-08 07:32:49.105771 7f5af3fff700 20 cr:s=0x7f5ae8bf7c30:op=0x7f5ae83893c0:22RGWSimpleRadosUnlockCR: operate()
2017-02-08 07:32:49.105777 7f5af3fff700 20 cr:s=0x7f5ae8bf7c30:op=0x7f5ae8bf7380:20RGWContinuousLeaseCR: operate()
2017-02-08 07:32:49.105778 7f5af3fff700 20 run: stack=0x7f5ae8bf7c30 is done
2017-02-08 07:32:49.105780 7f5af3fff700 20 cr:s=0x7f5ae80fbdd0:op=0x7f5ae8bf48c0:18RGWDataSyncShardCR: operate()
2017-02-08 07:32:49.105781 7f5af3fff700 20 collect(): s=0x7f5ae80fbdd0 stack=0x7f5ae8bf7c30 is complete
2017-02-08 07:32:49.105783 7f5af3fff700 20 cr:s=0x7f5ae80fbdd0:op=0x7f5ae8bf48c0:18RGWDataSyncShardCR: operate()

Not sure if it is related to http://tracker.ceph.com/issues/17569

Comment 2 Orit Wasserman 2017-02-08 14:28:49 UTC
Hi Shilpa,
can you provides radosgw logs (for all the nodes).

Thanks


Note You need to log in before you can comment on or make changes to this bug.