1987041 – RGW container fails to start when rgw thread pool size is close to 2048

Bug 1987041 - RGW container fails to start when rgw thread pool size is close to 2048

Summary: RGW container fails to start when rgw thread pool size is close to 2048

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	4.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.3
Assignee:	Teoman ONAY
QA Contact:	Sunil Kumar Nagaraju
Docs Contact:	Ranjini M N
URL:
Whiteboard:
Depends On:
Blocks:	1760354 1987235 2031070
TreeView+	depends on / blocked

Reported:	2021-07-28 19:08 UTC by Teoman ONAY
Modified:	2022-05-05 07:53 UTC (History)
CC List:	13 users (show)
Fixed In Version:	ceph-ansible-4.0.64-1.el8cp, ceph-ansible-4.0.64-1.el7cp
Doc Type:	Bug Fix
Doc Text:	.Add the `--pid-limits` parameter as `-1` for podman and `0` for docker in the systemd file to start the container Previously, the number of processes allowed to run in containers, 2048 for podman and 4096 for docker, were not sufficient to start some containers which needed to start more processes than these limits. With this release, you can remove the limit of maximum processes that can be started by adding the `--pid-limits` parameter as `-1` for podman and as `0` for docker in the systemd unit files. As a result, the containers start even if you customize the internal processes which might need to run more processes than the default limits.
Clone Of:
Environment:
Last Closed:	2022-05-05 07:53:24 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 6777	None	Merged	[skip ci] container: add pids limit parameter	2021-11-22 08:28:23 UTC
Github	ceph ceph-ansible pull 6789	None	Merged	container: add pids limit parameter (backport #6777)	2021-11-22 08:28:24 UTC
Red Hat Issue Tracker	RHCEPH-638	None	None	None	2021-08-18 21:58:46 UTC
Red Hat Product Errata	RHSA-2022:1716	None	None	None	2022-05-05 07:53:50 UTC

Description Teoman ONAY 2021-07-28 19:08:55 UTC

Description of problem:

RGW container fails to start with:

Jul 15 14:20:43 servera kernel: cgroup: fork rejected by pids controller in /machine.slice/libpod-54832992fcbbbf92b0d10d0491f7ff987728bec87c1c55b79cb3921c6f503f49.scope
Jul 15 14:20:43 servera conmon[34853]: terminate called after throwing an instance of 'std::system_error'
Jul 15 14:20:43 servera conmon[34853]:  what():  Resource temporarily unavailable
Jul 15 14:20:43 servera conmon[34853]: *** Caught signal (Aborted) **
Jul 15 14:20:43 servera conmon[34853]: in thread 7f5d605e1280 

The podman default pids-limit is set to 2048. 

$ grep . sys/fs/cgroup/pids/machine.slice/libpod-*/pids.max
sys/fs/cgroup/pids/machine.slice/libpod-9336707e04da464b9128b7c57a0ee9b70efc5acb5207ea03ab413583c2264283.scope/pids.max:2048
sys/fs/cgroup/pids/machine.slice/libpod-d0eb2257c2fb371e015af9b68c716ed280399d9d2e88925aad9323ac26f659f3.scope/pids.max:2048

While this value of 2048 is more than sufficient when the rgw thread pool size uses its default value of 512, when rgw thread pool size is increased up to a value near to the pids-limit value, it does not leave place for the other processes to spawn and run within the container and the container crashes.

Version-Release number of selected component (if applicable):
ceph 4.2z2

How reproducible:

Steps to Reproduce:
1. Change the value of rgw thread pool size from 512 to 2048 and redeploy the RGW using ceph-ansible.
2. start the RGW using systemctl start ceph-radosgw.rgwX

Actual results:

Jul 15 14:20:42 servera systemd[1]: Started Ceph RGW.
Jul 15 14:20:42 servera conmon[34853]: 2021-07-15 14:20:42.800 7f5d605e1280  0 deferred set uid:gid to 167:167 (ceph:ceph)
Jul 15 14:20:42 servera conmon[34853]: 2021-07-15 14:20:42.800 7f5d605e1280  0 ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable), process radosgw, pid 109
Jul 15 14:20:42 servera conmon[35177]: 2021-07-15 14:20:42  /opt/ceph-container/bin/entrypoint.sh: static: does not generate config
Jul 15 14:20:43 servera conmon[35177]: HEALTH_WARN 1 pools have too few placement groups; 20 pools have too many placement groups
Jul 15 14:20:43 servera conmon[34853]: 2021-07-15 14:20:43.245 7f5d605e1280  0 starting handler: beast
Jul 15 14:20:43 servera conmon[34853]: 2021-07-15 14:20:43.246 7f5d605e1280  0 set uid:gid to 167:167 (ceph:ceph)
Jul 15 14:20:43 servera kernel: cgroup: fork rejected by pids controller in /machine.slice/libpod-54832992fcbbbf92b0d10d0491f7ff987728bec87c1c55b79cb3921c6f503f49.scope
Jul 15 14:20:43 servera conmon[34853]: terminate called after throwing an instance of 'std::system_error'
Jul 15 14:20:43 servera conmon[34853]:  what():  Resource temporarily unavailable
Jul 15 14:20:43 servera conmon[34853]: *** Caught signal (Aborted) **
Jul 15 14:20:43 servera conmon[34853]: in thread 7f5d605e1280 thread_name:radosgw
Jul 15 14:20:43 servera conmon[34853]: ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)
Jul 15 14:20:43 servera conmon[34853]: 1: (()+0x12b20) [0x7f5d53285b20]
Jul 15 14:20:43 servera conmon[34853]: 2: (gsignal()+0x10f) [0x7f5d525b637f]
Jul 15 14:20:43 servera conmon[34853]: 3: (abort()+0x127) [0x7f5d525a0db5]
Jul 15 14:20:43 servera conmon[34853]: 4: (()+0x9009b) [0x7f5d52f6e09b]
Jul 15 14:20:43 servera conmon[34853]: 5: (()+0x9653c) [0x7f5d52f7453c]
Jul 15 14:20:43 servera conmon[34853]: 6: (()+0x96597) [0x7f5d52f74597]
Jul 15 14:20:43 servera conmon[34853]: 7: (()+0x967f8) [0x7f5d52f747f8]
Jul 15 14:20:43 servera conmon[34853]: 8: (()+0x9223b) [0x7f5d52f7023b]
Jul 15 14:20:43 servera conmon[34853]: 9: (()+0xc2e9d) [0x7f5d52fa0e9d]
Jul 15 14:20:43 servera conmon[34853]: 10: (RGWAsioFrontend::run()+0x1c5) [0x55a5bfa88b85]
Jul 15 14:20:43 servera conmon[34853]: 11: (main()+0x2851) [0x55a5bfa2d851]
Jul 15 14:20:43 servera conmon[34853]: 12: (__libc_start_main()+0xf3) [0x7f5d525a2493]
Jul 15 14:20:43 servera conmon[34853]: 13: (_start()+0x2e) [0x55a5bfa47cae]
Jul 15 14:20:43 servera conmon[34853]: 2021-07-15 14:20:43.303 7f5d605e1280 -1 *** Caught signal (Aborted) **
Jul 15 14:20:43 servera conmon[34853]: in thread 7f5d605e1280 thread_name:radosgw

Expected results:
Container should start.

Additional info:
ceph-ansible does not take into account that when rgw thread pool size is increased the pids-max of container should be adapted.

The solution would be to modify ceph-rgw/templates/ceph-radosgw.service.j2 to add to the command line parameter --pids-limit={{ radosgw_thread_pool_size + 2048 }} to podman

Comment 15 errata-xmlrpc 2022-05-05 07:53:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 4.3 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1716

Note You need to log in before you can comment on or make changes to this bug.