| Summary: | radosgw process segfaults when multisite params are changed w/sync in progress | ||
|---|---|---|---|
| Product: | Red Hat Ceph Storage | Reporter: | shilpa <smanjara> |
| Component: | RGW | Assignee: | Casey Bodley <cbodley> |
| Status: | CLOSED ERRATA | QA Contact: | shilpa <smanjara> |
| Severity: | urgent | Docs Contact: | Erin Donnelly <edonnell> |
| Priority: | unspecified | ||
| Version: | 2.0 | CC: | cbodley, ceph-eng-bugs, edonnell, hnallurv, kbader, kdreyer, mbenjamin, owasserm, smanjara, sweil, yehuda |
| Target Milestone: | rc | ||
| Target Release: | 2.2 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | RHEL: ceph-10.2.5-7.el7cp Ubuntu: ceph_10.2.5-3redhat1xenial | Doc Type: | Known Issue |
| Doc Text: |
.Multi-site configuration of the Ceph Object Gateway sometimes fails when options are changed at runtime
When the `rgw md log max shards` and `rgw data log num shards` options are changed at runtime in multi-site configuration of the Ceph Object Gateway, the `radosgw` process terminates unexpectedly with a segmentation fault.
To avoid this issue, do not change the aforementioned options at runtime, but set them during the initial configuration of the Ceph Object Gateway.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-03-14 15:43:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 1322504, 1383917, 1412948, 1437916 | ||
I imagine that this is caused by restarting the gateway with a different value of rgw_data_log_num_shards. The sync status is stored per shard, so the gateway should recognize this and restart a full sync. This hasn't been tested, though. If this segfault becomes a blocker for you, make sure that these config variables are set during your initial cluster setup so they don't change at runtime. Does that still happen with Casey's instructions here? From Yehuda's email today:
> I don't believe this one is critical for 2.0 (shouldn't happen in
> normal configuration, only when using obscure undocumented config).
Re-targeting to 2.1.
Need to ensure that the following parameters are not changed during the radosgw process runtime. rgw md log max shards rgw data log num shards These parameters should be set during the configuration of multisite and before restarting radosgw. Dev will explore adding guards to prevent improper config changes of this type. Verified on ceph-radosgw-10.2.5-13.el7cp.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0514.html |
Description of problem: On a 3 way cluster active-active, found radosgw continuously crashing on two of the nodes. Version-Release number of selected component (if applicable): ceph-radosgw-10.2.0-1.el7cp.x86_64 How reproducible: Only once Steps to Reproduce: Not sure about what caused the seg fault. But the data had synced on all the nodes. I enabled the following options in the config file and restarted the gateways: rgw md log max shards = 1 rgw data log num shards = 1 After restarting the gateway manually, the process started crashing continuously until I disabled the above options. However, I am not able to reproduce this again by enabling the options. Actual results: 0> 2016-04-27 10:35:10.652776 7f84727fc700 -1 *** Caught signal (Segmentation fault) ** in thread 7f84727fc700 thread_name:radosgw ceph version 10.2.0-1.el7cp (3a9fba20ec743699b69bd0181dd6c54dc01c64b9) 1: (()+0x54659a) [0x7f8584caa59a] 2: (()+0xf100) [0x7f85840e4100] 3: (()+0x1c48d0) [0x7f857a03b8d0] 4: (()+0x241c50) [0x7f857a0b8c50] 5: (()+0x21a448) [0x7f857a091448] 6: (()+0xc3649) [0x7f8579f3a649] 7: (()+0xd04a6) [0x7f8579f474a6] 8: (()+0xd3e72) [0x7f8579f4ae72] 9: (()+0xd418d) [0x7f8579f4b18d] 10: (()+0xa7700) [0x7f8579f1e700] 11: (librados::IoCtx::operate(std::string const&, librados::ObjectReadOperation*, ceph::buffer::list*)+0x46) [0x7f8579ed8af6] 12: (RGWRados::time_log_info(std::string const&, cls_log_header*)+0xd9) [0x7f8584b7b0a9] 13: (RGWDataChangesLog::get_info(int, RGWDataChangesLogInfo*)+0x75) [0x7f8584a41ba5] 14: (RGWOp_DATALog_ShardInfo::execute()+0x11b) [0x7f8584bd03cb] 15: (process_request(RGWRados*, RGWREST*, RGWRequest*, RGWStreamIO*, OpsLogSocket*)+0xd07) [0x7f8584b30637] 16: (()+0x1937a) [0x7f858e5cc37a] 17: (()+0x2330f) [0x7f858e5d630f] 18: (()+0x252f8) [0x7f858e5d82f8] 19: (()+0x7dc5) [0x7f85840dcdc5] 20: (clone()+0x6d) [0x7f85836e728d] I do see a lot of these messages occuring: 2016-04-27 10:57:02.551733 7f93c2ffd700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -22 2016-04-27 10:57:02.551751 7f93c2ffd700 0 rgw meta sync: ERROR: RGWBackoffControlCR called coroutine returned -22 2016-04-27 10:57:02.552426 7f93c2ffd700 0 ERROR: failed to fetch remote data log info: ret=-22 2016-04-27 10:57:02.557895 7f93c2ffd700 0 ERROR: failed to fetch remote data log info: ret=-22 2016-04-27 10:57:02.558080 7f93c2ffd700 0 ERROR: failed to fetch remote data log info: ret=-22 On all the nodes data is synced. data sync source: 9e11a3ed-376f-4f14-b493-cdb45fb88310 (us-2) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source source: eb432a5e-92fb-480c-9147-f02df4f558e1 (us-3) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source