Bug 1393665

Summary:	Multisite error handling leads to segfaults
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	shilpa <smanjara>
Component:	RGW	Assignee:	Matt Benjamin (redhat) <mbenjamin>
Status:	CLOSED ERRATA	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	2.1	CC:	cbodley, ceph-eng-bugs, hnallurv, kbader, kdreyer, mbenjamin, owasserm, sweil, tserlin
Target Milestone:	rc
Target Release:	2.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	RHEL: ceph-10.2.3-13.el7cp Ubuntu: ceph_10.2.3-14redhat1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-22 19:33:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description shilpa 2016-11-10 06:19:21 UTC

Description of problem:
While running S3test workload, rgw process crashes:

in thread 7f6bf6ffd700 thread_name:radosgw

 ceph version 10.2.3-12.el7cp (120ddb2dc963bbd3fe12b13c19f7a69422e2d039)
 1: (()+0x5709ca) [0x7f6da3b929ca]
 2: (()+0xf100) [0x7f6da2fa1100]
 3: (gsignal()+0x37) [0x7f6da24e25f7]
 4: (abort()+0x148) [0x7f6da24e3ce8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f6da3d84e47]
 6: (Mutex::Lock(bool)+0x19c) [0x7f6da3d0e8dc]
 7: (RGWRemoteDataLog::wakeup(int, std::set<std::string, std::less<std::string>, std::allocator<std::string> >&)+0x9f) [0x7f6da39a344f]
 8: (RGWRados::wakeup_data_sync_shards(std::string const&, std::map<int, std::set<std::string, std::less<std::string>, std::allocator<std::string> >, std::less<int>, std::allocator<std::pair<int const, std::set<std::string, std::less<std::string>, std::allocator<std::string> > > > >&)+0x28f) [0x7f6da3a0cabf]
 9: (RGWOp_DATALog_Notify::execute()+0x495) [0x7f6da3ab5ff5]
 10: (process_request(RGWRados*, RGWREST*, RGWRequest*, RGWStreamIO*, OpsLogSocket*)+0xd7f) [0x7f6da39ff7df]
 11: (()+0x192b3) [0x7f6dad4b92b3]
 12: (()+0x2327f) [0x7f6dad4c327f]
 13: (()+0x25298) [0x7f6dad4c5298]
 14: (()+0x7dc5) [0x7f6da2f99dc5]
 15: (clone()+0x6d) [0x7f6da25a3ced]

The related fixes are upstream:

http://tracker.ceph.com/issues/17569 
http://tracker.ceph.com/issues/17570
http://tracker.ceph.com/issues/17571


Version-Release number of selected component (if applicable):
10.2.3-12

How reproducible:
One out of three times

Steps to Reproduce:
1. Create multisite configuration with two zones
2. Run S3tests workload
3. At some point the race condition is hit

Comment 8 shilpa 2016-11-14 14:35:51 UTC

Tested and verified on ceph-10.2.3-13

Comment 10 errata-xmlrpc 2016-11-22 19:33:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2815.html