Bug 1476888 - rgw: segfault in RGWMetaSyncShardCR::incremental_sync completion
Summary: rgw: segfault in RGWMetaSyncShardCR::incremental_sync completion
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 2.3
Hardware: All
OS: All
high
high
Target Milestone: rc
: 2.4
Assignee: Matt Benjamin (redhat)
QA Contact: Warren
Bara Ancincova
URL:
Whiteboard:
Depends On:
Blocks: 1473436 1479701
TreeView+ depends on / blocked
 
Reported: 2017-07-31 17:51 UTC by Matt Benjamin (redhat)
Modified: 2020-12-14 09:17 UTC (History)
18 users (show)

Fixed In Version: RHEL: ceph-10.2.7-33.el7cp Ubuntu: ceph_10.2.7-34redhat1
Doc Type: Bug Fix
Doc Text:
.The multi-site synchronization works as expected Due to an object lifetime defect in the Ceph Object Gateway multi-site synchronization code path, a failure could occur during incremental sync. The underlying source code has been modified, and the multi-site synchronization works as expected.
Clone Of:
Environment:
Last Closed: 2017-10-17 18:12:51 UTC
Embargoed:


Attachments (Terms of Use)
bucketbrigade.py (585 bytes, text/plain)
2017-09-27 17:14 UTC, Warren
no flags Details
bucketbrigade.py (585 bytes, text/plain)
2017-09-27 17:15 UTC, Warren
no flags Details
rstart.sh (108 bytes, text/plain)
2017-09-27 17:17 UTC, Warren
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 20251 0 None None None 2017-07-31 17:51:07 UTC
Red Hat Product Errata RHBA-2017:2903 0 normal SHIPPED_LIVE Red Hat Ceph Storage 2.4 enhancement and bug fix update 2017-10-17 22:12:30 UTC

Description Matt Benjamin (redhat) 2017-07-31 17:51:08 UTC
Description of problem:

--- Comment #13 from Casey Bodley <cbodley> ---
(In reply to Harald Klein from comment #3)
>      0> 2017-07-27 16:50:59.945399 7fe5fffcf700 -1 *** Caught signal
> (Segmentation fault) **
>  in thread 7fe5fffcf700 thread_name:radosgw
>
>  ceph version 10.2.7-28.el7cp (216cda64fd9a9b43c4b0c2f8c402d36753ee35f7)
>  1: (()+0x58e79a) [0x7fe8325b979a]
>  2: (()+0xf370) [0x7fe8319aa370]
>  3: (Mutex::Lock(bool)+0x4) [0x7fe832735b44]
>  4: (RGWCompletionManager::wakeup(void*)+0x18) [0x7fe832305418]
>  5: (RGWMetaSyncShardCR::incremental_sync()+0xda1) [0x7fe8323b5b41]
>  6: (RGWMetaSyncShardCR::operate()+0x44) [0x7fe8323b7714]
>  7: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7fe83230497e]
>  8: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*,
> std::allocator<RGWCoroutinesStack*> >&)+0x3f8) [0x7fe832
> 307468]
>  9: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7fe832307fe0]
>  10: (RGWRemoteMetaLog::run_sync()+0xfc2) [0x7fe8323a7202]
>  11: (RGWMetaSyncProcessorThread::process()+0xd) [0x7fe83248d7cd]
>  12: (RGWRadosThread::Worker::entry()+0x133) [0x7fe83242d043]
>  13: (()+0x7dc5) [0x7fe8319a2dc5]
>  14: (clone()+0x6d) [0x7fe830faf73d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
> -------

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 shilpa 2017-09-11 08:02:15 UTC
Hi Casey,

Could you please provide steps to reproduce this BZ?

Comment 23 Warren 2017-09-27 17:14:43 UTC
Created attachment 1331489 [details]
bucketbrigade.py

Used to reproduce this problem, (run on master)

Comment 24 Warren 2017-09-27 17:15:57 UTC
Created attachment 1331490 [details]
bucketbrigade.py

Used to reproduce the problem.  Run on master.

Comment 25 Warren 2017-09-27 17:17:01 UTC
Created attachment 1331491 [details]
rstart.sh

Used to reproduce the bug.  Run on secondary.

Comment 26 Warren 2017-09-27 17:18:49 UTC
Successfully reproduced:

Running bucketbrigade.py on the master, and rstart.sh on tne secondary, this problem occurred 3 times in about 14 hours.

Comment 27 Warren 2017-09-27 17:27:10 UTC
I reproduced this on magna009 (in case anyone wants to look at the Segmentation faults in the rgw log).

Comment 31 Warren 2017-09-28 01:31:39 UTC
This test has been running with the patch for over 5 hours without reporting the problem.  I will leave it to run overnight, and will be in before 9 AM PST.  If this test shows no more problems at that time, then I will mark it as Verified.

Comment 36 Tamil 2017-09-30 03:47:35 UTC
The fix has been running on the test bed for 17 hours now with NO sign of segmentation fault.
Marking it as "verified".

Comment 39 Warren 2017-10-04 22:04:15 UTC
Running this test on the 2.4A async build failed.  Talking to tserlin, it appears that this change is in both sets of patches:

ON 24.A

https://code.engineering.redhat.com/gerrit/gitweb?p=ceph.git;a=commit;h=0c28f6912f03f2def4532c9c6a4c958f714bd206

ON Hotfix

https://code.engineering.redhat.com/gerrit/gitweb?p=ceph.git;a=commit;h=d1aad1b7c92e7305fe3e1a8cd6496c7d1df124a2

Comment 42 Warren 2017-10-06 18:16:03 UTC
The Crash appears once and is a different one from the bug that was fixed.  I will report that crash as another bug.  I am marking this as verified for 2.4Async

Comment 48 errata-xmlrpc 2017-10-17 18:12:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2903


Note You need to log in before you can comment on or make changes to this bug.