Bug 1393665

Summary: Multisite error handling leads to segfaults
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: shilpa <smanjara>
Component: RGWAssignee: Matt Benjamin (redhat) <mbenjamin>
Status: CLOSED ERRATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.1CC: cbodley, ceph-eng-bugs, hnallurv, kbader, kdreyer, mbenjamin, owasserm, sweil, tserlin
Target Milestone: rc   
Target Release: 2.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-10.2.3-13.el7cp Ubuntu: ceph_10.2.3-14redhat1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-22 19:33:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description shilpa 2016-11-10 06:19:21 UTC
Description of problem:
While running S3test workload, rgw process crashes:

in thread 7f6bf6ffd700 thread_name:radosgw

 ceph version 10.2.3-12.el7cp (120ddb2dc963bbd3fe12b13c19f7a69422e2d039)
 1: (()+0x5709ca) [0x7f6da3b929ca]
 2: (()+0xf100) [0x7f6da2fa1100]
 3: (gsignal()+0x37) [0x7f6da24e25f7]
 4: (abort()+0x148) [0x7f6da24e3ce8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f6da3d84e47]
 6: (Mutex::Lock(bool)+0x19c) [0x7f6da3d0e8dc]
 7: (RGWRemoteDataLog::wakeup(int, std::set<std::string, std::less<std::string>, std::allocator<std::string> >&)+0x9f) [0x7f6da39a344f]
 8: (RGWRados::wakeup_data_sync_shards(std::string const&, std::map<int, std::set<std::string, std::less<std::string>, std::allocator<std::string> >, std::less<int>, std::allocator<std::pair<int const, std::set<std::string, std::less<std::string>, std::allocator<std::string> > > > >&)+0x28f) [0x7f6da3a0cabf]
 9: (RGWOp_DATALog_Notify::execute()+0x495) [0x7f6da3ab5ff5]
 10: (process_request(RGWRados*, RGWREST*, RGWRequest*, RGWStreamIO*, OpsLogSocket*)+0xd7f) [0x7f6da39ff7df]
 11: (()+0x192b3) [0x7f6dad4b92b3]
 12: (()+0x2327f) [0x7f6dad4c327f]
 13: (()+0x25298) [0x7f6dad4c5298]
 14: (()+0x7dc5) [0x7f6da2f99dc5]
 15: (clone()+0x6d) [0x7f6da25a3ced]

The related fixes are upstream:

http://tracker.ceph.com/issues/17569 
http://tracker.ceph.com/issues/17570
http://tracker.ceph.com/issues/17571


Version-Release number of selected component (if applicable):
10.2.3-12

How reproducible:
One out of three times

Steps to Reproduce:
1. Create multisite configuration with two zones
2. Run S3tests workload
3. At some point the race condition is hit

Comment 8 shilpa 2016-11-14 14:35:51 UTC
Tested and verified on ceph-10.2.3-13

Comment 10 errata-xmlrpc 2016-11-22 19:33:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2815.html