Bug 1358129 - Multisite some of the object sync operations were skipped
Summary: Multisite some of the object sync operations were skipped
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 2.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: 2.0
Assignee: Casey Bodley
QA Contact: shilpa
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-20 06:41 UTC by shilpa
Modified: 2017-07-31 14:15 UTC (History)
10 users (show)

Fixed In Version: RHEL: ceph-10.2.2-26.el7cp Ubuntu: ceph_10.2.2-20redhat1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-23 19:44:48 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 16742 0 None None None 2016-07-20 06:41:23 UTC
Red Hat Product Errata RHBA-2016:1755 0 normal SHIPPED_LIVE Red Hat Ceph Storage 2.0 bug fix and enhancement update 2016-08-23 23:23:52 UTC

Description shilpa 2016-07-20 06:41:24 UTC
Description of problem:
Upload multipart objects on both zones. A few objects were skipped from being synced on one of the zones and retry was not attempted. 

Version-Release number of selected component (if applicable):
ceph-radosgw-10.2.2-23.el7cp.x86_64

How reproducible:
Saw this issue only on the new build ceph-radosgw-10.2.2-23.el7cp.x86_64


Actual results:

The initial multipart upload of bucket3/big.txt started on magna115 here:

2016-07-19 10:07:35.218618 7f0acbfff700  1 ====== starting new request
req=0x7f0acbff9710 =====
2016-07-19 10:07:35.218629 7f0acbfff700  2 req 3236:0.000011::PUT
/bucket3/big.txt::initializing for trans_id =
tx000000000000000000ca4-00578dfbe7-5e46-us-1

and finished here:

2016-07-19 10:08:17.049985 7f0ac47f0700  1 ====== req done
req=0x7f0ac47ea710 op status=0 http_status=200 ======
2016-07-19 10:08:17.050015 7f0ac47f0700  1 civetweb: 0x7f0b780009b0:
10.8.128.74 - - [19/Jul/2016:10:08:16 +0000] "PUT /bucket3/big.txt
HTTP/1.1" 200 0 - Boto/2.41.0 Python/2.7.5 Linux/3.10.0-327.el7.x86_64


magna059 sees it in the sync log:

2016-07-19 10:08:41.390609 7f33daffd700 20 bucket sync single entry
(source_zone=f5717851-2682-475a-b24b-7bcdec728cbe)
b=bucket3:f5717851-2682-475a-b24b-7bcdec728cbe.14122.41/big.txt[0]
log_entry=00000000204.11489.3 op=0 op_state=1
2016-07-19 10:08:41.390699 7f33daffd700 20
cr:s=0x7f3354197230:op=0x7f335463bf40:26RGWBucketSyncSingleEntryCRISs11rgw_obj_keyE:
operate()
2016-07-19 10:08:41.390706 7f33daffd700  5 bucket sync: sync obj:
f5717851-2682-475a-b24b-7bcdec728cbe/bucket3(@{i=us-2.rgw.buckets.index,e=us-1.rgw.buckets.non-ec}us-2.rgw.buckets.data[f5717851-2682-475a-b24b-7bcdec728cbe.14122.41])/big.txt[0]
2016-07-19 10:08:41.390711 7f33daffd700  5
Sync:f5717851:data:Object:bucket3:f5717851-2682-475a-b24b-7bcdec728cbe.14122.41/big.txt[0]:fetch
...
2016-07-19 10:08:41.397525 7f33f47e8700 20 sending request to
http://magna115:80/bucket3/big.txt?rgwx-zonegroup=0bf0fc77-43ce-4a44-8b16-8f5fcfa84c95&rgwx-prepend-metadata=0bf0fc77-43ce-4a44-8b16-8f5fcfa84c95


magna115 sees GET request from magna059:

2016-07-19 10:08:37.179355 7f0b31ffb700  2 req 3332:0.001339:s3:GET
/bucket3/big.txt:get_obj:executing

... (almost 50 minutes later) ...

2016-07-19 10:57:59.221172 7f0b31ffb700  0 ERROR: flush_read_list():
d->client_c->handle_data() returned -5
2016-07-19 10:57:59.221193 7f0b31ffb700 20 get_obj_data::cancel_all_io()
2016-07-19 10:57:59.221629 7f0b31ffb700  0 WARNING: set_req_state_err
err_no=5 resorting to 500
2016-07-19 10:57:59.221737 7f0b31ffb700  2 req 3332:2962.043722:s3:GET
/bucket3/big.txt:get_obj:completing
2016-07-19 10:57:59.221748 7f0b31ffb700  2 req 3332:2962.043733:s3:GET
/bucket3/big.txt:get_obj:op status=-5
2016-07-19 10:57:59.221752 7f0b31ffb700  2 req 3332:2962.043737:s3:GET
/bucket3/big.txt:get_obj:http status=500
2016-07-19 10:57:59.221759 7f0b31ffb700  1 ====== req done
req=0x7f0b31ff5710 op status=-5 http_status=500 ======
2016-07-19 10:57:59.221786 7f0b31ffb700 20 process_request() returned -5


magna059 gets error reply:

2016-07-19 10:59:52.175208 7f33f47e8700  0 store->fetch_remote_obj()
returned r=-5
2016-07-19 10:59:52.175604 7f33f47e8700  1 heartbeat_map reset_timeout
'RGWAsyncRadosProcessor::m_tp thread 0x7f33f47e8700' had timed out after 600
...
2016-07-19 10:59:52.177590 7f33daffd700 20
cr:s=0x7f3354197230:op=0x7f3354473800:19RGWFetchRemoteObjCR: operate()
2016-07-19 10:59:52.177596 7f33daffd700 20
cr:s=0x7f3354197230:op=0x7f3354473800:19RGWFetchRemoteObjCR: operate()
returned r=-5
...
2016-07-19 10:59:52.178023 7f33daffd700 20
cr:s=0x7f3354197230:op=0x7f335463bf40:26RGWBucketSyncSingleEntryCRISs11rgw_obj_keyE:
operate()
2016-07-19 10:59:52.178028 7f33daffd700  5
Sync:f5717851:data:Object:bucket3:f5717851-2682-475a-b24b-7bcdec728cbe.14122.41/big.txt[0]:done,
retcode=-5
2016-07-19 10:59:52.178031 7f33daffd700  0 ERROR: failed to sync object:
bucket3:f5717851-2682-475a-b24b-7bcdec728cbe.14122.41/big.txt
...
2016-07-19 11:00:05.774443 7f33daffd700 20
cr:s=0x7f3354197230:op=0x7f335463bf40:26RGWBucketSyncSingleEntryCRISs11rgw_obj_keyE:
operate()
2016-07-19 11:00:05.774445 7f33daffd700 20
cr:s=0x7f3354197230:op=0x7f335463bf40:26RGWBucketSyncSingleEntryCRISs11rgw_obj_keyE:
operate() returned r=-5
2016-07-19 11:00:05.774448 7f33daffd700  5
Sync:f5717851:data:Object:bucket3:f5717851-2682-475a-b24b-7bcdec728cbe.14122.41/big.txt[0]:finish
2016-07-19 11:00:05.774451 7f33daffd700 20 stack->operate() returned ret=-5

Despite the error, we still update incremental bucket sync position.

Comment 6 John Poelstra 2016-07-20 15:47:57 UTC
Will get pulled into next build

Comment 11 shilpa 2016-08-01 12:28:32 UTC
The issue has not been seen since ceph-10.2.2-26. Moving to verified

Comment 13 errata-xmlrpc 2016-08-23 19:44:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html


Note You need to log in before you can comment on or make changes to this bug.