1330952 – radosgw process segfaults when multisite params are changed w/sync in progress

Bug 1330952 - radosgw process segfaults when multisite params are changed w/sync in progress

Summary: radosgw process segfaults when multisite params are changed w/sync in progress

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RGW
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	2.2
Assignee:	Casey Bodley
QA Contact:	shilpa
Docs Contact:	Erin Donnelly
URL:
Whiteboard:
Depends On:
Blocks:	1322504 1383917 1412948 1437916
TreeView+	depends on / blocked

Reported:	2016-04-27 11:08 UTC by shilpa
Modified:	2017-07-30 15:39 UTC (History)
CC List:	11 users (show)
Fixed In Version:	RHEL: ceph-10.2.5-7.el7cp Ubuntu: ceph_10.2.5-3redhat1xenial
Doc Type:	Known Issue
Doc Text:	.Multi-site configuration of the Ceph Object Gateway sometimes fails when options are changed at runtime When the `rgw md log max shards` and `rgw data log num shards` options are changed at runtime in multi-site configuration of the Ceph Object Gateway, the `radosgw` process terminates unexpectedly with a segmentation fault. To avoid this issue, do not change the aforementioned options at runtime, but set them during the initial configuration of the Ceph Object Gateway.
Clone Of:
Environment:
Last Closed:	2017-03-14 15:43:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	17414	None	None	None	2016-09-27 15:12:33 UTC
Ceph Project Bug Tracker	18488	None	None	None	2017-01-11 14:58:02 UTC
Red Hat Product Errata	RHBA-2017:0514	normal	SHIPPED_LIVE	Red Hat Ceph Storage 2.2 bug fix and enhancement update	2017-03-21 07:24:26 UTC

Description shilpa 2016-04-27 11:08:12 UTC

Description of problem:
On a 3 way cluster active-active, found radosgw continuously crashing on two of the nodes. 

Version-Release number of selected component (if applicable):
ceph-radosgw-10.2.0-1.el7cp.x86_64

How reproducible:
Only once

Steps to Reproduce:
Not sure about what caused the seg fault. But the data had synced on all the nodes. I enabled the following options in the config file and restarted the gateways:

 rgw md log max shards = 1
 rgw data log num shards = 1

After restarting the gateway manually, the process started crashing continuously until I disabled the above options. However, I am not able to reproduce this again by enabling the options. 


Actual results:
    0> 2016-04-27 10:35:10.652776 7f84727fc700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f84727fc700 thread_name:radosgw

 ceph version 10.2.0-1.el7cp (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
 1: (()+0x54659a) [0x7f8584caa59a]
 2: (()+0xf100) [0x7f85840e4100]
 3: (()+0x1c48d0) [0x7f857a03b8d0]
 4: (()+0x241c50) [0x7f857a0b8c50]
 5: (()+0x21a448) [0x7f857a091448]
 6: (()+0xc3649) [0x7f8579f3a649]
 7: (()+0xd04a6) [0x7f8579f474a6]
 8: (()+0xd3e72) [0x7f8579f4ae72]
 9: (()+0xd418d) [0x7f8579f4b18d]
 10: (()+0xa7700) [0x7f8579f1e700]
 11: (librados::IoCtx::operate(std::string const&, librados::ObjectReadOperation*, ceph::buffer::list*)+0x46) [0x7f8579ed8af6]
 12: (RGWRados::time_log_info(std::string const&, cls_log_header*)+0xd9) [0x7f8584b7b0a9]
 13: (RGWDataChangesLog::get_info(int, RGWDataChangesLogInfo*)+0x75) [0x7f8584a41ba5]
 14: (RGWOp_DATALog_ShardInfo::execute()+0x11b) [0x7f8584bd03cb]
 15: (process_request(RGWRados*, RGWREST*, RGWRequest*, RGWStreamIO*, OpsLogSocket*)+0xd07) [0x7f8584b30637]
 16: (()+0x1937a) [0x7f858e5cc37a]
 17: (()+0x2330f) [0x7f858e5d630f]
 18: (()+0x252f8) [0x7f858e5d82f8]
 19: (()+0x7dc5) [0x7f85840dcdc5]
 20: (clone()+0x6d) [0x7f85836e728d]


I do see a lot of these messages occuring:

2016-04-27 10:57:02.551733 7f93c2ffd700  0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -22
2016-04-27 10:57:02.551751 7f93c2ffd700  0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -22
2016-04-27 10:57:02.552426 7f93c2ffd700  0 ERROR: failed to fetch remote data log info: ret=-22
2016-04-27 10:57:02.557895 7f93c2ffd700  0 ERROR: failed to fetch remote data log info: ret=-22
2016-04-27 10:57:02.558080 7f93c2ffd700  0 ERROR: failed to fetch remote data log info: ret=-22


On all the nodes data is synced.

     data sync source: 9e11a3ed-376f-4f14-b493-cdb45fb88310 (us-2)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source
                source: eb432a5e-92fb-480c-9147-f02df4f558e1 (us-3)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source

Comment 3 Casey Bodley 2016-05-06 21:47:49 UTC

I imagine that this is caused by restarting the gateway with a different value of rgw_data_log_num_shards. The sync status is stored per shard, so the gateway should recognize this and restart a full sync. This hasn't been tested, though.

If this segfault becomes a blocker for you, make sure that these config variables are set during your initial cluster setup so they don't change at runtime.

Comment 4 Yehuda Sadeh 2016-05-06 22:22:02 UTC

Does that still happen with Casey's instructions here?

Comment 5 Ken Dreyer (Red Hat) 2016-05-10 13:02:48 UTC

From Yehuda's email today:
> I don't believe this one is critical for 2.0 (shouldn't happen in
> normal configuration, only when using obscure undocumented config).

Re-targeting to 2.1.

Comment 8 shilpa 2016-06-08 10:28:40 UTC

Need to ensure that the following parameters are not changed during the radosgw process runtime.

rgw md log max shards
rgw data log num shards

These parameters should be set during the configuration of multisite and before restarting radosgw.

Comment 14 Matt Benjamin (redhat) 2016-10-03 17:44:36 UTC

Dev will explore adding guards to prevent improper config changes of this type.

Comment 22 shilpa 2017-02-03 11:30:42 UTC

Verified on ceph-radosgw-10.2.5-13.el7cp.x86_64

Comment 27 errata-xmlrpc 2017-03-14 15:43:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0514.html

Note You need to log in before you can comment on or make changes to this bug.