1552202 – RGW multi-site segfault received when 'rgw_run_sync_thread = False' is set in ceph.conf

Bug 1552202 - RGW multi-site segfault received when 'rgw_run_sync_thread = False' is set in ceph.conf

Summary: RGW multi-site segfault received when 'rgw_run_sync_thread = False' is set in...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RGW-Multisite
Sub Component:
Version:	3.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	z4
Target Release:	3.0
Assignee:	Matt Benjamin (redhat)
QA Contact:	Vidushi Mishra
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1553254
TreeView+	depends on / blocked

Reported:	2018-03-06 16:44 UTC by jquinn
Modified:	2021-09-09 13:24 UTC (History)
CC List:	12 users (show)
Fixed In Version:	RHEL: ceph-12.2.4-15.el7cp Ubuntu: 12.2.4-19redhat1xenial
Doc Type:	Bug Fix
Doc Text:	Previously, due to a programming error, Ceph RADOS Gateway (RGW) instances in zones configured for multi-site replication would crash if configured to disable sync ("rgw_run_sync_thread = false"). Therefor, multi-site replication environments could not start dedicated non-replication RGW instances. With this update, the "rgw_run_sync_thread" option can be used to configure RGW instances that will not participate in replication even if their zone is replicated. If this option is set for all active RGW instances in the zone, replication will not take place.
Clone Of:
Environment:
Last Closed:	2018-07-11 18:11:08 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	20448	None	None	None	2018-03-07 10:08:08 UTC
Github	ceph ceph pull 20769	None	closed	rgw: fix crash with rgw_run_sync_thread false	2021-02-16 02:56:44 UTC
Github	ceph ceph pull 20932	None	closed	luminous: rgw: fix crash with rgw_run_sync_thread false	2021-02-16 02:56:44 UTC
Red Hat Issue Tracker	RHCEPH-1551	None	None	None	2021-09-09 13:24:33 UTC
Red Hat Product Errata	RHSA-2018:2177	None	None	None	2018-07-11 18:11:55 UTC

Description jquinn 2018-03-06 16:44:06 UTC

Description of problem: RGW process receives segfault when 'rgw_run_sync_thread = False' option is set in the ceph.conf for the RGW instance.  In this case they  using this for a containerized deployment of rgw. 

This issue is known in http://tracker.ceph.com/issues/20448, but has not yet been resolved. 

The customer is looking to have 4 RGW instances in a multi-site config, but 2 of them will be dedicated for client requests and not handle replication.  This flag appears to be the only way to handle this request. 


Version-Release number of selected component (if applicable):12.2.1-40


How reproducible:every time


Steps to Reproduce:
1.Deploy RGW instance (multi-site config not needed to re-produce)
2.add rgw_run_sync_thread = False to ceph.conf for the rgw instance
3.restart rgw service. 

Actual results:

[client.rgw.vm250-102.gsslab.pnq2.redhat.com]
debug_rgw = 20
osd_heartbeat_grace = 60 
host = vm250-102
keyring = /var/lib/ceph/radosgw/ceph-rgw.vm250-102/keyring
log file = /var/log/ceph/ceph-rgw-vm250-102.log
rgw frontends = civetweb port=10.74.250.102:8080 num_threads=100
rgw_run_sync_thread = False 


** Journalctl ** 

Mar 06 08:51:19 vm250-102.gsslab.pnq2.redhat.com systemd[1]: ceph-radosgw.service: main process exited, code=exited, status=1/FAILURE
Mar 06 08:51:20 vm250-102.gsslab.pnq2.redhat.com docker[4023]: Error response from daemon: No such container: ceph-rgw-vm250-102
Mar 06 08:51:20 vm250-102.gsslab.pnq2.redhat.com systemd[1]: Unit ceph-radosgw.service entered failed state.
Mar 06 08:51:20 vm250-102.gsslab.pnq2.redhat.com systemd[1]: ceph-radosgw.service failed.
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com systemd[1]: ceph-radosgw.service holdoff time over, scheduling restart.
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com systemd[1]: Starting Ceph RGW...
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com systemd-journal[16792]: Suppressed 955 messages from /system.slice/docker.service
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com dockerd-current[11747]: time="2018-03-06T08:51:30.086802327-05:00" level=error msg="Handler for POST /v1.24/containers/ceph-rgw-vm250-102/stop?t=10 returned error: No such container: ceph-r
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com dockerd-current[11747]: time="2018-03-06T08:51:30.086844695-05:00" level=error msg="Handler for POST /v1.24/containers/ceph-rgw-vm250-102/stop returned error: No such container: ceph-rgw-vm
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com systemd-journal[16792]: Suppressed 282 messages from /system.slice/system-ceph\x2dradosgw.slice
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com docker[4032]: Error response from daemon: No such container: ceph-rgw-vm250-102
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com dockerd-current[11747]: time="2018-03-06T08:51:30.113109222-05:00" level=error msg="Handler for DELETE /v1.24/containers/ceph-rgw-vm250-102 returned error: No such container: ceph-rgw-vm250
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com dockerd-current[11747]: time="2018-03-06T08:51:30.113140646-05:00" level=error msg="Handler for DELETE /v1.24/containers/ceph-rgw-vm250-102 returned error: No such container: ceph-rgw-vm250
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com docker[4036]: Error response from daemon: No such container: ceph-rgw-vm250-102
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com systemd[1]: Started Ceph RGW.
Mar 06 08:51:30 vm250-102.gsslab.pnq2.redhat.com kernel: XFS (dm-3): Mounting V5 Filesystem



[root@vm250-102 ~]# systemctl status ceph-radosgw.service 
● ceph-radosgw.service - Ceph RGW
   Loaded: loaded (/etc/systemd/system/ceph-radosgw@.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Tue 2018-03-06 08:52:40 EST; 3s ago
  Process: 5641 ExecStopPost=/usr/bin/docker stop ceph-rgw-vm250-102 (code=exited, status=1/FAILURE)
  Process: 5387 ExecStart=/usr/bin/docker run --rm --net=host --memory=1g --cpu-quota=100000 -v /var/lib/ceph:/var/lib/ceph -v /etc/ceph:/etc/ceph -e RGW_CIVETWEB_IP=10.74.250.102 -v /etc/localtime:/etc/localtime:ro -e CEPH_DAEMON=RGW -e CLUSTER=ceph -e RGW_CIVETWEB_PORT=8080 --name=ceph-rgw-vm250-102 registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest (code=exited, status=1/FAILURE)
  Process: 5381 ExecStartPre=/usr/bin/docker rm ceph-rgw-vm250-102 (code=exited, status=1/FAILURE)
  Process: 5377 ExecStartPre=/usr/bin/docker stop ceph-rgw-vm250-102 (code=exited, status=1/FAILURE)
 Main PID: 5387 (code=exited, status=1/FAILURE)

Mar 06 08:52:40 vm250-102.gsslab.pnq2.redhat.com systemd[1]: Unit ceph-radosgw.service entered failed state.
Mar 06 08:52:40 vm250-102.gsslab.pnq2.redhat.com systemd[1]: ceph-radosgw.service failed.
[root@vm250-102 ~]# 



Expected results:Expect RGW process to start and when value is enabled this instance should not perform replication. 


Additional info:

Comment 3 Orit Wasserman 2018-03-07 10:07:32 UTC

upstream fix:
https://github.com/ceph/ceph/pull/20769

Comment 14 errata-xmlrpc 2018-07-11 18:11:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2177

Note You need to log in before you can comment on or make changes to this bug.