Bug 1552202

Summary:	RGW multi-site segfault received when 'rgw_run_sync_thread = False' is set in ceph.conf
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	jquinn <jquinn>
Component:	RGW-Multisite	Assignee:	Matt Benjamin (redhat) <mbenjamin>
Status:	CLOSED ERRATA	QA Contact:	Vidushi Mishra <vimishra>
Severity:	high	Docs Contact:
Priority:	low
Version:	3.0	CC:	ceph-eng-bugs, ceph-qe-bugs, edonnell, hnallurv, mbenjamin, mhackett, molasaga, owasserm, tchandra, tserlin, vimishra, vumrao
Target Milestone:	z4
Target Release:	3.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHEL: ceph-12.2.4-15.el7cp Ubuntu: 12.2.4-19redhat1xenial	Doc Type:	Bug Fix
Doc Text:	Previously, due to a programming error, Ceph RADOS Gateway (RGW) instances in zones configured for multi-site replication would crash if configured to disable sync ("rgw_run_sync_thread = false"). Therefor, multi-site replication environments could not start dedicated non-replication RGW instances. With this update, the "rgw_run_sync_thread" option can be used to configure RGW instances that will not participate in replication even if their zone is replicated. If this option is set for all active RGW instances in the zone, replication will not take place.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-07-11 18:11:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1553254

Description jquinn 2018-03-06 16:44:06 UTC

Description of problem: RGW process receives segfault when 'rgw_run_sync_thread = False' option is set in the ceph.conf for the RGW instance.  In this case they  using this for a containerized deployment of rgw. 

This issue is known in http://tracker.ceph.com/issues/20448, but has not yet been resolved. 

The customer is looking to have 4 RGW instances in a multi-site config, but 2 of them will be dedicated for client requests and not handle replication.  This flag appears to be the only way to handle this request. 


Version-Release number of selected component (if applicable):12.2.1-40


How reproducible:every time


Steps to Reproduce:
1.Deploy RGW instance (multi-site config not needed to re-produce)
2.add rgw_run_sync_thread = False to ceph.conf for the rgw instance
3.restart rgw service. 

Actual results:

[client.rgw.vm250-102.gsslab.pnq2.redhat.com]
debug_rgw = 20
osd_heartbeat_grace = 60 
host = vm250-102
keyring = /var/lib/ceph/radosgw/ceph-rgw.vm250-102/keyring
log file = /var/log/ceph/ceph-rgw-vm250-102.log
rgw frontends = civetweb port=10.74.250.102:8080 num_threads=100
rgw_run_sync_thread = False 


** Journalctl ** 

Mar 06 08:51:19 vm250-102.gsslab.pnq2.redhat.com systemd[1]: ceph-radosgw.service: main process exited, code=exited, status=1/FAILURE
Mar 06 08:51:20 vm250-102.gsslab.pnq2.redhat.com docker[4023]: Error response from daemon: No such container: ceph-rgw-vm250-102
Mar 06 08:51:20 vm250-102.gsslab.pnq2.redhat.com systemd[1]: Unit ceph-radosgw.service entered failed state.
Mar 06 08:51:20 vm250-102.gsslab.pnq2.redhat.com systemd[1]: ceph-radosgw.service failed.
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com systemd[1]: ceph-radosgw.service holdoff time over, scheduling restart.
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com systemd[1]: Starting Ceph RGW...
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com systemd-journal[16792]: Suppressed 955 messages from /system.slice/docker.service
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com dockerd-current[11747]: time="2018-03-06T08:51:30.086802327-05:00" level=error msg="Handler for POST /v1.24/containers/ceph-rgw-vm250-102/stop?t=10 returned error: No such container: ceph-r
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com dockerd-current[11747]: time="2018-03-06T08:51:30.086844695-05:00" level=error msg="Handler for POST /v1.24/containers/ceph-rgw-vm250-102/stop returned error: No such container: ceph-rgw-vm
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com systemd-journal[16792]: Suppressed 282 messages from /system.slice/system-ceph\x2dradosgw.slice
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com docker[4032]: Error response from daemon: No such container: ceph-rgw-vm250-102
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com dockerd-current[11747]: time="2018-03-06T08:51:30.113109222-05:00" level=error msg="Handler for DELETE /v1.24/containers/ceph-rgw-vm250-102 returned error: No such container: ceph-rgw-vm250
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com dockerd-current[11747]: time="2018-03-06T08:51:30.113140646-05:00" level=error msg="Handler for DELETE /v1.24/containers/ceph-rgw-vm250-102 returned error: No such container: ceph-rgw-vm250
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com docker[4036]: Error response from daemon: No such container: ceph-rgw-vm250-102
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com systemd[1]: Started Ceph RGW.
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com kernel: XFS (dm-3): Mounting V5 Filesystem



[root@vm250-102 ~]# systemctl status ceph-radosgw.service 
● ceph-radosgw.service - Ceph RGW
   Loaded: loaded (/etc/systemd/system/ceph-radosgw@.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Tue 2018-03-06 08:52:40 EST; 3s ago
  Process: 5641 ExecStopPost=/usr/bin/docker stop ceph-rgw-vm250-102 (code=exited, status=1/FAILURE)
  Process: 5387 ExecStart=/usr/bin/docker run --rm --net=host --memory=1g --cpu-quota=100000 -v /var/lib/ceph:/var/lib/ceph -v /etc/ceph:/etc/ceph -e RGW_CIVETWEB_IP=10.74.250.102 -v /etc/localtime:/etc/localtime:ro -e CEPH_DAEMON=RGW -e CLUSTER=ceph -e RGW_CIVETWEB_PORT=8080 --name=ceph-rgw-vm250-102 registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest (code=exited, status=1/FAILURE)
  Process: 5381 ExecStartPre=/usr/bin/docker rm ceph-rgw-vm250-102 (code=exited, status=1/FAILURE)
  Process: 5377 ExecStartPre=/usr/bin/docker stop ceph-rgw-vm250-102 (code=exited, status=1/FAILURE)
 Main PID: 5387 (code=exited, status=1/FAILURE)

Mar 06 08:52:40 vm250-102.gsslab.pnq2.redhat.com systemd[1]: Unit ceph-radosgw.service entered failed state.
Mar 06 08:52:40 vm250-102.gsslab.pnq2.redhat.com systemd[1]: ceph-radosgw.service failed.
[root@vm250-102 ~]# 



Expected results:Expect RGW process to start and when value is enabled this instance should not perform replication. 


Additional info:

Comment 3 Orit Wasserman 2018-03-07 10:07:32 UTC

upstream fix:
https://github.com/ceph/ceph/pull/20769

Comment 14 errata-xmlrpc 2018-07-11 18:11:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2177