Bug 1330952 - radosgw process segfaults when multisite params are changed w/sync in progress
Summary: radosgw process segfaults when multisite params are changed w/sync in progress
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RGW
Version: 2.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: 2.2
Assignee: Casey Bodley
QA Contact: shilpa
Erin Donnelly
URL:
Whiteboard:
Depends On:
Blocks: 1322504 1383917 1412948 1437916
TreeView+ depends on / blocked
 
Reported: 2016-04-27 11:08 UTC by shilpa
Modified: 2017-07-30 15:39 UTC (History)
11 users (show)

Fixed In Version: RHEL: ceph-10.2.5-7.el7cp Ubuntu: ceph_10.2.5-3redhat1xenial
Doc Type: Known Issue
Doc Text:
.Multi-site configuration of the Ceph Object Gateway sometimes fails when options are changed at runtime When the `rgw md log max shards` and `rgw data log num shards` options are changed at runtime in multi-site configuration of the Ceph Object Gateway, the `radosgw` process terminates unexpectedly with a segmentation fault. To avoid this issue, do not change the aforementioned options at runtime, but set them during the initial configuration of the Ceph Object Gateway.
Clone Of:
Environment:
Last Closed: 2017-03-14 15:43:45 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0514 normal SHIPPED_LIVE Red Hat Ceph Storage 2.2 bug fix and enhancement update 2017-03-21 07:24:26 UTC
Ceph Project Bug Tracker 17414 None None None 2016-09-27 15:12:33 UTC
Ceph Project Bug Tracker 18488 None None None 2017-01-11 14:58:02 UTC

Description shilpa 2016-04-27 11:08:12 UTC
Description of problem:
On a 3 way cluster active-active, found radosgw continuously crashing on two of the nodes. 

Version-Release number of selected component (if applicable):
ceph-radosgw-10.2.0-1.el7cp.x86_64

How reproducible:
Only once

Steps to Reproduce:
Not sure about what caused the seg fault. But the data had synced on all the nodes. I enabled the following options in the config file and restarted the gateways:

 rgw md log max shards = 1
 rgw data log num shards = 1

After restarting the gateway manually, the process started crashing continuously until I disabled the above options. However, I am not able to reproduce this again by enabling the options. 


Actual results:
    0> 2016-04-27 10:35:10.652776 7f84727fc700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f84727fc700 thread_name:radosgw

 ceph version 10.2.0-1.el7cp (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
 1: (()+0x54659a) [0x7f8584caa59a]
 2: (()+0xf100) [0x7f85840e4100]
 3: (()+0x1c48d0) [0x7f857a03b8d0]
 4: (()+0x241c50) [0x7f857a0b8c50]
 5: (()+0x21a448) [0x7f857a091448]
 6: (()+0xc3649) [0x7f8579f3a649]
 7: (()+0xd04a6) [0x7f8579f474a6]
 8: (()+0xd3e72) [0x7f8579f4ae72]
 9: (()+0xd418d) [0x7f8579f4b18d]
 10: (()+0xa7700) [0x7f8579f1e700]
 11: (librados::IoCtx::operate(std::string const&, librados::ObjectReadOperation*, ceph::buffer::list*)+0x46) [0x7f8579ed8af6]
 12: (RGWRados::time_log_info(std::string const&, cls_log_header*)+0xd9) [0x7f8584b7b0a9]
 13: (RGWDataChangesLog::get_info(int, RGWDataChangesLogInfo*)+0x75) [0x7f8584a41ba5]
 14: (RGWOp_DATALog_ShardInfo::execute()+0x11b) [0x7f8584bd03cb]
 15: (process_request(RGWRados*, RGWREST*, RGWRequest*, RGWStreamIO*, OpsLogSocket*)+0xd07) [0x7f8584b30637]
 16: (()+0x1937a) [0x7f858e5cc37a]
 17: (()+0x2330f) [0x7f858e5d630f]
 18: (()+0x252f8) [0x7f858e5d82f8]
 19: (()+0x7dc5) [0x7f85840dcdc5]
 20: (clone()+0x6d) [0x7f85836e728d]


I do see a lot of these messages occuring:

2016-04-27 10:57:02.551733 7f93c2ffd700  0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -22
2016-04-27 10:57:02.551751 7f93c2ffd700  0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -22
2016-04-27 10:57:02.552426 7f93c2ffd700  0 ERROR: failed to fetch remote data log info: ret=-22
2016-04-27 10:57:02.557895 7f93c2ffd700  0 ERROR: failed to fetch remote data log info: ret=-22
2016-04-27 10:57:02.558080 7f93c2ffd700  0 ERROR: failed to fetch remote data log info: ret=-22


On all the nodes data is synced.

     data sync source: 9e11a3ed-376f-4f14-b493-cdb45fb88310 (us-2)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source
                source: eb432a5e-92fb-480c-9147-f02df4f558e1 (us-3)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source

Comment 3 Casey Bodley 2016-05-06 21:47:49 UTC
I imagine that this is caused by restarting the gateway with a different value of rgw_data_log_num_shards. The sync status is stored per shard, so the gateway should recognize this and restart a full sync. This hasn't been tested, though.

If this segfault becomes a blocker for you, make sure that these config variables are set during your initial cluster setup so they don't change at runtime.

Comment 4 Yehuda Sadeh 2016-05-06 22:22:02 UTC
Does that still happen with Casey's instructions here?

Comment 5 Ken Dreyer (Red Hat) 2016-05-10 13:02:48 UTC
From Yehuda's email today:
> I don't believe this one is critical for 2.0 (shouldn't happen in
> normal configuration, only when using obscure undocumented config).

Re-targeting to 2.1.

Comment 8 shilpa 2016-06-08 10:28:40 UTC
Need to ensure that the following parameters are not changed during the radosgw process runtime.

rgw md log max shards
rgw data log num shards

These parameters should be set during the configuration of multisite and before restarting radosgw.

Comment 14 Matt Benjamin (redhat) 2016-10-03 17:44:36 UTC
Dev will explore adding guards to prevent improper config changes of this type.

Comment 22 shilpa 2017-02-03 11:30:42 UTC
Verified on ceph-radosgw-10.2.5-13.el7cp.x86_64

Comment 27 errata-xmlrpc 2017-03-14 15:43:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0514.html


Note You need to log in before you can comment on or make changes to this bug.