Bug 1476888 - rgw: segfault in RGWMetaSyncShardCR::incremental_sync completion
rgw: segfault in RGWMetaSyncShardCR::incremental_sync completion
Status: CLOSED ERRATA
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RGW (Show other bugs)
2.3
All All
high Severity high
: rc
: 2.4
Assigned To: Matt Benjamin (redhat)
Warren
Bara Ancincova
:
Depends On:
Blocks: 1473436 1479701
  Show dependency treegraph
 
Reported: 2017-07-31 13:51 EDT by Matt Benjamin (redhat)
Modified: 2017-10-18 14:13 EDT (History)
18 users (show)

See Also:
Fixed In Version: RHEL: ceph-10.2.7-33.el7cp Ubuntu: ceph_10.2.7-34redhat1
Doc Type: Bug Fix
Doc Text:
.The multi-site synchronization works as expected Due to an object lifetime defect in the Ceph Object Gateway multi-site synchronization code path, a failure could occur during incremental sync. The underlying source code has been modified, and the multi-site synchronization works as expected.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-10-17 14:12:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
bucketbrigade.py (585 bytes, text/plain)
2017-09-27 13:14 EDT, Warren
no flags Details
bucketbrigade.py (585 bytes, text/plain)
2017-09-27 13:15 EDT, Warren
no flags Details
rstart.sh (108 bytes, text/plain)
2017-09-27 13:17 EDT, Warren
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 20251 None None None 2017-07-31 13:51 EDT

  None (edit)
Description Matt Benjamin (redhat) 2017-07-31 13:51:08 EDT
Description of problem:

--- Comment #13 from Casey Bodley <cbodley@redhat.com> ---
(In reply to Harald Klein from comment #3)
>      0> 2017-07-27 16:50:59.945399 7fe5fffcf700 -1 *** Caught signal
> (Segmentation fault) **
>  in thread 7fe5fffcf700 thread_name:radosgw
>
>  ceph version 10.2.7-28.el7cp (216cda64fd9a9b43c4b0c2f8c402d36753ee35f7)
>  1: (()+0x58e79a) [0x7fe8325b979a]
>  2: (()+0xf370) [0x7fe8319aa370]
>  3: (Mutex::Lock(bool)+0x4) [0x7fe832735b44]
>  4: (RGWCompletionManager::wakeup(void*)+0x18) [0x7fe832305418]
>  5: (RGWMetaSyncShardCR::incremental_sync()+0xda1) [0x7fe8323b5b41]
>  6: (RGWMetaSyncShardCR::operate()+0x44) [0x7fe8323b7714]
>  7: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7fe83230497e]
>  8: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*,
> std::allocator<RGWCoroutinesStack*> >&)+0x3f8) [0x7fe832
> 307468]
>  9: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7fe832307fe0]
>  10: (RGWRemoteMetaLog::run_sync()+0xfc2) [0x7fe8323a7202]
>  11: (RGWMetaSyncProcessorThread::process()+0xd) [0x7fe83248d7cd]
>  12: (RGWRadosThread::Worker::entry()+0x133) [0x7fe83242d043]
>  13: (()+0x7dc5) [0x7fe8319a2dc5]
>  14: (clone()+0x6d) [0x7fe830faf73d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
> -------

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Comment 4 shilpa 2017-09-11 04:02:15 EDT
Hi Casey,

Could you please provide steps to reproduce this BZ?
Comment 23 Warren 2017-09-27 13:14 EDT
Created attachment 1331489 [details]
bucketbrigade.py

Used to reproduce this problem, (run on master)
Comment 24 Warren 2017-09-27 13:15 EDT
Created attachment 1331490 [details]
bucketbrigade.py

Used to reproduce the problem.  Run on master.
Comment 25 Warren 2017-09-27 13:17 EDT
Created attachment 1331491 [details]
rstart.sh

Used to reproduce the bug.  Run on secondary.
Comment 26 Warren 2017-09-27 13:18:49 EDT
Successfully reproduced:

Running bucketbrigade.py on the master, and rstart.sh on tne secondary, this problem occurred 3 times in about 14 hours.
Comment 27 Warren 2017-09-27 13:27:10 EDT
I reproduced this on magna009 (in case anyone wants to look at the Segmentation faults in the rgw log).
Comment 31 Warren 2017-09-27 21:31:39 EDT
This test has been running with the patch for over 5 hours without reporting the problem.  I will leave it to run overnight, and will be in before 9 AM PST.  If this test shows no more problems at that time, then I will mark it as Verified.
Comment 36 Tamil 2017-09-29 23:47:35 EDT
The fix has been running on the test bed for 17 hours now with NO sign of segmentation fault.
Marking it as "verified".
Comment 39 Warren 2017-10-04 18:04:15 EDT
Running this test on the 2.4A async build failed.  Talking to tserlin, it appears that this change is in both sets of patches:

ON 24.A

https://code.engineering.redhat.com/gerrit/gitweb?p=ceph.git;a=commit;h=0c28f6912f03f2def4532c9c6a4c958f714bd206

ON Hotfix

https://code.engineering.redhat.com/gerrit/gitweb?p=ceph.git;a=commit;h=d1aad1b7c92e7305fe3e1a8cd6496c7d1df124a2
Comment 42 Warren 2017-10-06 14:16:03 EDT
The Crash appears once and is a different one from the bug that was fixed.  I will report that crash as another bug.  I am marking this as verified for 2.4Async
Comment 48 errata-xmlrpc 2017-10-17 14:12:51 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2903

Note You need to log in before you can comment on or make changes to this bug.