Bug 1459891

Summary:	[Docs] Director should increase kernel.thread-max on ceph backed compute nodes
Product:	Red Hat OpenStack	Reporter:	Tomas Rusnak <trusnak>
Component:	documentation	Assignee:	RHOS Documentation Team <rhos-docs>
Status:	CLOSED EOL	QA Contact:	RHOS Documentation Team <rhos-docs>
Severity:	high	Docs Contact:
Priority:	low
Version:	10.0 (Newton)	CC:	bengland, cminkema, dwilson, jomurphy, jtaleric, kbader, mburns, mnelson, nlevinki, srevivo, twilkins
Target Milestone:	---	Keywords:	Documentation
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-07 10:40:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tomas Rusnak 2017-06-08 13:10:43 UTC

Description of problem:

In large scale deployments the kernel.pid_max parameter is set to higher value then default - 1048576. This was solved in BZ138950. The parameter is saying how many numerical pid numbers can be assigned by kernel. I don't think, this will solve the problem completely, as there is different parameter which limits number of concurrent running threads on system. 
/proc/sys/kernel/threads-max is actually the maximum number of elements contained in the data structure task_struct. Which is the data structure that contains the list of processes/threads.
So if we just raise kernel.pid_max and we let kernel.threads-max at default value 46031, we should touch this limit in large scales.

Version-Release number of selected component (if applicable):
Ocata, Newton, Pike

How reproducible:
I haven't system in such big scale available

Steps to Reproduce:
1. # cat /proc/sys/kernel/pid_max
1048576
2. # cat /proc/sys/kernel/threads-max
46031

Actual results:


Expected results:


Additional info:

Comment 2 Ben England 2017-07-19 18:46:31 UTC

Tomas, 

Good points, it got me thinking harder about this.

We should document thread-related kernel parameter requirements for RHOSP 10 and 11, which depend on the older RHCS 2. 

In RHCS 3.0, Ceph is switching to a new "async messenger", replacing the "simple messenger" component that caused the massive consumption of threads per OSD and per instance (in librados).   So this should not be as much of an issue at that point.  RHCS 3.0 is supposed to be the release used with RHOSP 12 (Pike) and will definitely be used for RHOSP 13 (Queens).

However, for RHOSP 10 and 11, which integrate with RHCS 2, this will still be an issue.  I think threads_max does not have to be as big as pid_max, but with simple messenger you still need on the order of 2 threads/OSD x (guests + OSDs). 

I saw librados pthread_create failing to create a thread recently (RHOSP 11), even with the higher pid_max.  

https://bugzilla.redhat.com/show_bug.cgi?id=1461530#c8

I used this program to investigate what was going on - it just does pthread_create N times to see how many threads can be run at the same time.

http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/openstack/thread-create.c

and looked at just kernel.threads-max, kernel.pid_max, and vm.max_map_count

When I started with an untuned RHEL7.3 kernel on a 256-GB host:

[root@c04-h01-6048r ~]# sysctl -a | grep threads-max
kernel.threads-max = 2061221
[root@c04-h01-6048r ~]# sysctl -a | grep pid_max
kernel.pid_max = 57344
[root@c04-h01-6048r ~]# sysctl -a | grep vm.max_map_count   
vm.max_map_count = 65530

[root@c04-h01-6048r ~]# ./thread-create 200000
thread count: 200000
fatal: Error creating thread
errno 12 with thrd=56824: Cannot allocate memory

[root@c04-h01-6048r ~]# sysctl -w kernel.pid_max=1048576
kernel.pid_max = 1048576
[root@c04-h01-6048r ~]# sysctl -w vm.max_map_count=400000
vm.max_map_count = 400000

[root@c04-h01-6048r ~]# ./thread-create 200000
thread count: 200000
fatal: Error creating thread
errno 12 with thrd=199989: Cannot allocate memory

[root@c04-h01-6048r ~]# sysctl -w vm.max_map_count=500000
vm.max_map_count = 500000

[root@c04-h01-6048r ~]# ./thread-create 200000
thread count: 200000
SUCCESS

So this vm.max_map_count limits how many threads you can create!  Was not obvious to me at first.   Found this in a discussion of JVM thread creation.

https://stackoverflow.com/questions/5635362/max-thread-per-process-in-linux

Note that on a RHEL7.3 kernel with 256 GB RAM, the thread-max default for this appears to be:

[root@c04-h01-6048r ~]# sysctl -a | grep threads-max
kernel.threads-max = 2061221

So kernel.threads-max was likely not the problem in this case.

Comment 3 Lucy Bopf 2017-07-27 02:20:19 UTC

Clearing target release pending docs triage.

Comment 4 Ben England 2018-02-06 16:31:29 UTC

So my conclusion above was that vm.max_map_count had to be increased to >> 2x the total number of threads used by Ceph OSDs or RADOS clients, and before RHCS 3.0, this is quite high for a large cluster.  To calculate:

number of processes using librados (RBD clients, OSDs, RGWs, Cephfs clients) x number of OSDs x 2.  For example, in an RHHI cluster with 36 OSDs/host, and 50 guests with Cinder volumes, and 1000 OSDs in the cluster:

50 guests/host x 1000 OSD connections/guest x 2 threads/OSD = 100000
36 OSDs/host x 1000 OSD connections/OSD x 2 threads/connection = 72000

The default value on RHEL7.4 is 65530.  So Ceph would not be able to connect up the cluster.

This problem does go away in RHCS 3.0 but there is still a huge problem with RHOSP 12 and RHOSP 11 support.  At a minimum this needs to be documented.  Can we get this fix into RHOSP12.z?

Comment 5 Ben England 2019-01-03 14:50:35 UTC

This bug is not present in RHCS 3.0 or RHOSP 13 so mark it fixed in next (long-term) release?