Description of problem: --- Comment #13 from Casey Bodley <cbodley> --- (In reply to Harald Klein from comment #3) > 0> 2017-07-27 16:50:59.945399 7fe5fffcf700 -1 *** Caught signal > (Segmentation fault) ** > in thread 7fe5fffcf700 thread_name:radosgw > > ceph version 10.2.7-28.el7cp (216cda64fd9a9b43c4b0c2f8c402d36753ee35f7) > 1: (()+0x58e79a) [0x7fe8325b979a] > 2: (()+0xf370) [0x7fe8319aa370] > 3: (Mutex::Lock(bool)+0x4) [0x7fe832735b44] > 4: (RGWCompletionManager::wakeup(void*)+0x18) [0x7fe832305418] > 5: (RGWMetaSyncShardCR::incremental_sync()+0xda1) [0x7fe8323b5b41] > 6: (RGWMetaSyncShardCR::operate()+0x44) [0x7fe8323b7714] > 7: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7fe83230497e] > 8: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, > std::allocator<RGWCoroutinesStack*> >&)+0x3f8) [0x7fe832 > 307468] > 9: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7fe832307fe0] > 10: (RGWRemoteMetaLog::run_sync()+0xfc2) [0x7fe8323a7202] > 11: (RGWMetaSyncProcessorThread::process()+0xd) [0x7fe83248d7cd] > 12: (RGWRadosThread::Worker::entry()+0x133) [0x7fe83242d043] > 13: (()+0x7dc5) [0x7fe8319a2dc5] > 14: (clone()+0x6d) [0x7fe830faf73d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > ------- Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Hi Casey, Could you please provide steps to reproduce this BZ?
Created attachment 1331489 [details] bucketbrigade.py Used to reproduce this problem, (run on master)
Created attachment 1331490 [details] bucketbrigade.py Used to reproduce the problem. Run on master.
Created attachment 1331491 [details] rstart.sh Used to reproduce the bug. Run on secondary.
Successfully reproduced: Running bucketbrigade.py on the master, and rstart.sh on tne secondary, this problem occurred 3 times in about 14 hours.
I reproduced this on magna009 (in case anyone wants to look at the Segmentation faults in the rgw log).
This test has been running with the patch for over 5 hours without reporting the problem. I will leave it to run overnight, and will be in before 9 AM PST. If this test shows no more problems at that time, then I will mark it as Verified.
The fix has been running on the test bed for 17 hours now with NO sign of segmentation fault. Marking it as "verified".
Running this test on the 2.4A async build failed. Talking to tserlin, it appears that this change is in both sets of patches: ON 24.A https://code.engineering.redhat.com/gerrit/gitweb?p=ceph.git;a=commit;h=0c28f6912f03f2def4532c9c6a4c958f714bd206 ON Hotfix https://code.engineering.redhat.com/gerrit/gitweb?p=ceph.git;a=commit;h=d1aad1b7c92e7305fe3e1a8cd6496c7d1df124a2
The Crash appears once and is a different one from the bug that was fixed. I will report that crash as another bug. I am marking this as verified for 2.4Async
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2903