Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1355641 - RGW Segfaults during I/O and sync operations on non-master node
RGW Segfaults during I/O and sync operations on non-master node
Status: CLOSED ERRATA
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RGW (Show other bugs)
2.0
Unspecified Unspecified
unspecified Severity high
: rc
: 2.0
Assigned To: Casey Bodley
shilpa
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-07-12 03:01 EDT by shilpa
Modified: 2017-07-31 10:15 EDT (History)
10 users (show)

See Also:
Fixed In Version: RHEL: ceph-10.2.2-23.el7cp Ubuntu: ceph_10.2.2-18redhat1xenial
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-08-23 15:44:04 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 16666 None None None 2016-07-12 15:49 EDT
Red Hat Product Errata RHBA-2016:1755 normal SHIPPED_LIVE Red Hat Ceph Storage 2.0 bug fix and enhancement update 2016-08-23 19:23:52 EDT

  None (edit)
Description shilpa 2016-07-12 03:01:24 EDT
Description of problem:
Uploaded objects from both master and non-master. Segfault found before any data got synced to non-master zone.


Version-Release number of selected component (if applicable):
ceph-radosgw-10.2.2-18.el7cp.x86_64


Steps to Reproduce:
1. Upload files to the same buckets from both rgw nodes
2. Monitor the sync on both nodes
3. Set delay of 150 ms and packet loss of 5%

Actual results:

2016-07-12 06:33:31.803740 7fadf5feb700  0 ERROR: full sync on bucket26 bucket_id=f5717851-2682-475a-b24b-
7bcdec728cbe.14116.27 shard_id=-1 failed, retcode=-16
2016-07-12 06:33:31.803809 7fadf5feb700  0 ERROR: lease cr failed, done early 
2016-07-12 06:33:31.803817 7fadf5feb700  0 ERROR: full sync on bucket33 bucket_id=f5717851-2682-475a-b24b-
7bcdec728cbe.14116.34 shard_id=-1 failed, retcode=-16
2016-07-12 06:33:31.803823 7fadf5feb700  0 ERROR: lease cr failed, done early 
2016-07-12 06:33:31.803888 7fadf5feb700  0 ERROR: incremental sync on bucket62 bucket_id=f5717851-2682-475
a-b24b-7bcdec728cbe.14116.63 shard_id=-1 failed, retcode=-16


2016-07-12 06:33:31.966492 7facae7f4700 -1 *** Caught signal (Segmentation fault) **
 in thread 7facae7f4700 thread_name:radosgw

 ceph version 10.2.2-18.el7cp (408019449adec8263014b356737cf326544ea7c6)
 1: (()+0x54e2ba) [0x7fae83a6c2ba]
 2: (()+0xf100) [0x7fae82e9d100]
 3: (RGWCoroutinesStack::wakeup()+0xe) [0x7fae837e84ce]
 4: (RGWRemoteMetaLog::wakeup(int)+0x92) [0x7fae838762c2]
 5: (RGWRados::wakeup_meta_sync_shards(std::set<int, std::less<int>, std::allocator<int> >&)+0x4b) [0x7fae
838fbbcb]
 6: (RGWOp_MDLog_Notify::execute()+0x3f4) [0x7fae839989e4]
 7: (process_request(RGWRados*, RGWREST*, RGWRequest*, RGWStreamIO*, OpsLogSocket*)+0xd07) [0x7fae838eeb77
]
 8: (()+0x19373) [0x7fae8d390373]
 9: (()+0x232ef) [0x7fae8d39a2ef]
 10: (()+0x252d8) [0x7fae8d39c2d8]
 11: (()+0x7dc5) [0x7fae82e95dc5]
 12: (clone()+0x6d) [0x7fae8249fced]
Comment 4 Casey Bodley 2016-07-14 13:35:46 EDT
So far unable to reproduce this one. The log doesn't have debug information other than the ~30 seconds leading up to the segfault, so it's hard to see what's happening with the long-running RGWMetaSyncShardControlCR that's being woken up here.

I do see a potential issue that could lead to this stack trace. RGWMetaSyncCR holds a reference to its RGWMetaSyncShardControlCRs to guarantee that the coroutines won't be freed before it tries to call wakeup() on them. However, we don't hold references to the RGWCoroutinesStacks associated with the RGWMetaSyncShardControlCRs. So if a coroutine was to finish early, its stack would be freed and a later call to RGWCoroutinesStack::wakeup() would segfault.

I'll prepare and test a patch that holds a reference to the stack instead of the coroutine itself.

In the meantime, Shilpa, if you're able to reproduce this with --debug-rgw=20 and --debug-ms=1, I'd love to see the logs.
Comment 5 Ken Dreyer (Red Hat) 2016-07-14 15:48:07 EDT
PR undergoing review (assigned to Yehuda) https://github.com/ceph/ceph/pull/10301
Comment 6 Casey Bodley 2016-07-18 13:17:37 EDT
The fix has been cherry-picked to ceph-2-rhel-patches.
Comment 10 shilpa 2016-07-26 02:20:24 EDT
Verified in 10.2.2-23
Comment 12 errata-xmlrpc 2016-08-23 15:44:04 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html

Note You need to log in before you can comment on or make changes to this bug.