1389502 – director should increase kernel.pid_max on ceph backed compute nodes

Bug 1389502 - director should increase kernel.pid_max on ceph backed compute nodes

Summary: director should increase kernel.pid_max on ceph backed compute nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	10.0 (Newton)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	10.0 (Newton)
Assignee:	Giulio Fidente
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-27 18:23 UTC by Tim Wilkinson
Modified:	2016-12-14 16:26 UTC (History)
CC List:	19 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-5.0.0-1.7.el7ost
Doc Type:	Enhancement
Doc Text:	Feature: Allows for custom values for the kernel.pid_max sysctl key via KernelPidMax Heat parameter and defaults it to 1048576. Reason: On nodes working as Ceph clients there might be a large number of running threads, depending on the number of ceph-osd instances in which case the max value of pid_max might be hit causing I/O errors. Result: The pid_max key has a higher default and can be customized via KernelPidMax parameter.
Clone Of:
Environment:
Last Closed:	2016-12-14 16:26:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1639191	None	None	None	2016-11-04 10:15:06 UTC
OpenStack gerrit	395608	'None'	MERGED	Defaults kernel.pid_max to 1048576	2020-03-26 10:07:11 UTC
Red Hat Product Errata	RHEA-2016:2948	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 enhancement update	2016-12-14 19:55:27 UTC

Description Tim Wilkinson 2016-10-27 18:23:58 UTC

Description of problem:
----------------------
In our OSP10d deployed test-bed [21 computes backed by 1043 Ceph OSDs], the compute node kernel.pid_max is set to 49152 but ceph (especially with a large number of OSDs) can consume far more than this for thread count. The problem occurs because the threads per guest in this env is much greater than the number of OSDs times the number of guests. While most of the OSD process threads are idle until OSD repair/backfill is initiated, they count towards that limit so Ceph typically recommends increasing kernel.pid_max to a much higher value. Our environment immediately hit this limit (also seen in Ceph tracker http://tracker.ceph.com/issues/16118) so we increased pid_max to a much larger setting which resolved guests being terminated during I/O to ceph storage. Without doing so, not only were a subset of guests terminated during I/O, a subset of those terminated guests would no longer successfully start in nova. 

If unchanged at deployment, RHOSP 10 will fail to scale on RHCS 2.0.



Component Version-Release:
-------------------------
Red Hat Enterprise Linux Server release 7.3 Beta (Maipo)

kernel-3.10.0-510.el7.x86_64

ceph-*.x86_64                     1:10.2.2-41.el7cp @rhos-10.0-ceph-2.0-mon-signed
openstack-aodh-*.noarch            3.0.0-0.20160921151816.bb5103e.el7ost
openstack-ceilometer-*.noarch      1:7.0.0-0.20160928024313.67bbd3f.el7ost
openstack-cinder.noarch              1:9.0.0-0.20160928223334.ab95181.el7ost
openstack-dashboard.noarch 1:10.0.0-0.20161002185148.3252153.1.el7ost
openstack-glance.noarch 1:13.0.0-0.20160928121721.4404ae6.el7ost
openstack-gnocchi-*.noarch         3.0.1-0.20160923180636.c6b2c51.el7ost
openstack-heat-*.noarch            1:7.0.0-0.20160926200847.dd707bc.el7ost
openstack-ironic-*.noarch          1:6.2.1-0.20160930163405.3f54fec.el7ost
openstack-keystone.noarch 1:10.0.0-0.20160928144040.6520523.el7ost
openstack-manila.noarch              1:3.0.0-0.20160916162617.8f2fa31.el7ost
openstack-mistral-api.noarch         3.0.0-0.20160929083341.c0a4501.el7ost
openstack-neutron.noarch             1:9.0.0-0.20160929051647.71f2d2b.el7ost
openstack-nova-api.noarch 1:14.0.0-0.20160929203854.59653c6.el7ost
openstack-puppet-modules.noarch      1:9.0.0-0.20160915155755.8c758d6.el7ost
openstack-sahara.noarch              1:5.0.0-0.20160926213141.cbd51fa.el7ost
openstack-selinux.noarch             0.7.9-1.el7ost @rhos-10.0-puddle
openstack-swift-account.noarch       2.10.1-0.20160929005314.3349016.el7ost
openstack-swift-plugin-swift3.noarch 1.11.1-0.20160929001717.e7a2b88.el7ost
openstack-zaqar.noarch               1:3.0.0-0.20160921221617.3ef0881.el7ost
openvswitch.x86_64                   1:2.5.0-5.git20160628.el7fdb
puppet-ceph.noarch                   2.2.0-1.el7ost @rhos-10.0-puddle
puppet-openstack_extras.noarch       9.4.0-1.el7ost @rhos-10.0-puddle
puppet-openstacklib.noarch           9.4.0-0.20160929212001.0e58c86.el7ost
puppet-vswitch.noarch                5.4.0-1.el7ost @rhos-10.0-puddle
python-openstack-mistral.noarch      3.0.0-0.20160929083341.c0a4501.el7ost
python-openstackclient.noarch        3.2.0-0.20160914003636.8241f08.el7ost
python-openstacksdk.noarch           0.9.5-0.20160912180601.d7ee3ad.el7ost
python-openvswitch.noarch            1:2.5.0-5.git20160628.el7fdb
python-rados.x86_64                  1:10.2.2-41.el7cp @rhos-10.0-ceph-2.0-mon-signed
python-rbd.x86_64                    1:10.2.2-41.el7cp @rhos-10.0-ceph-2.0-mon-signed



How reproducible:
----------------
consistent



Steps to Reproduce:
------------------
1. Leave pid_max at default value
2. Start I/O on KVM guests to ceph storage
3. See a subset of the KVM guests terminated on the compute nodes and all I/O hang



Actual results:
--------------
A subset of running guests are terminated (see errors in Additional Info below), remaining guest I/O hangs.



Expected results:
----------------
No guests are terminated and I/O completes.



Additional info:
---------------
Guest termination observed in messages:

Oct 27 14:29:08 overcloud-novacompute-0 journal: internal error: End of file from monitor
Oct 27 14:29:08 overcloud-novacompute-0 kvm: 24 guests now active
Oct 27 14:29:08 overcloud-novacompute-0 systemd-machined: Machine qemu-70-instance-000005d2 terminated.
Oct 27 14:29:08 overcloud-novacompute-0 journal: End of file while reading data: Input/output error






Pthread creation failure observed in qemu instance logs:

Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(const char*, size_t)' thread 7f3a7249f700 time 2016-10-26 18:03:15.108590
common/Thread.cc: 160: FAILED assert(ret == 0)
 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee)
 1: (()+0x170635) [0x7f3a90958635]
 2: (()+0x193f3a) [0x7f3a9097bf3a]
 3: (()+0x333cd5) [0x7f3a90b1bcd5]
 4: (()+0x33438e) [0x7f3a90b1c38e]
 5: (()+0xd006e) [0x7f3a908b806e]
 6: (()+0xd0cd7) [0x7f3a908b8cd7]
 7: (()+0xd3e92) [0x7f3a908bbe92]
 8: (()+0xd41ad) [0x7f3a908bc1ad]
 9: (()+0xa829b) [0x7f3a9089029b]
 10: (librados::IoCtx::aio_operate(std::string const&, librados::AioCompletion*, librados::ObjectWriteOperation*, unsigned long, std::vector<unsigned long, std::allocator<unsigned long> >&)+0xe1) [0x7f3a9085cf71]
 11: (()+0x83729) [0x7f3a99fe7729]
 12: (()+0x83c4b) [0x7f3a99fe7c4b]
 13: (()+0x8654e) [0x7f3a99fea54e]
 14: (()+0x842dd) [0x7f3a99fe82dd]
 15: (()+0x743b9) [0x7f3a99fd83b9]
 16: (()+0x8b8aa) [0x7f3a99fef8aa]
 17: (()+0x9d7ed) [0x7f3a908857ed]
 18: (()+0x85cd9) [0x7f3a9086dcd9]
 19: (()+0x16f7e6) [0x7f3a909577e6]
 20: (()+0x7dc5) [0x7f3a8bb04dc5]
 21: (clone()+0x6d) [0x7f3a8b83373d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-10-26 18:03:15.389+0000: shutting down





We measured as many as 2000 threads inside a single qemu-kvm process. Below we determine the number of threads per qemu-kvm process, most of which are related to librbd.  e.g., dump out all the threads in qemu-kvm processes and then count how many are in each process:

# ps -eLf | grep qemu-kvm | grep -v grep > /tmp/p
# ps awux | grep qemu-kvm | grep -v grep | awk '{ print $2 }' > /tmp/pids
# for p in `cat /tmp/pids` ; do echo -n "$p " ;  awk "/ $p /" /tmp/p | wc -l ; done
9199 550
9496 562
141001 2029
141471 2027
142298 2030
143125 2027
143954 2028
144823 2027
145381 2027
146159 2028
146759 2028
147271 2030
147892 2027
148405 2027
148854 2027
149711 2029
149938 2027
150687 2027
151290 2029
151977 2028
152490 2027
153036 2028
153457 2027
154167 2027
154820 2029


and the total number of threads in just the qemu-kvm processes:

# for p in `cat /tmp/pids` ; do echo -n "$p " ;  awk "/ $p /" /tmp/p | wc -l ; done | awk '{sum += $2}END{print sum}'
47752

Comment 1 Ben England 2016-10-27 18:50:15 UTC

cc'ing Jeff Brown in RHS.  This problem impacts OpenStack-Ceph scalability and specifically use of Ceph in the OpenStack scale lab.

Comment 4 Giulio Fidente 2016-11-04 10:03:06 UTC

I think it defaults to 32K for compatibility with 32bit systems; for 64bit systems it is limited to 4M instead, as it consumes a more memory for higer values

do you think defaulting to 1M would be reasonable?

Comment 5 seb 2016-11-04 10:41:58 UTC

On OSP10, the systemd unit file of an OSD already has "TasksMax=infinity" so this shouldn't be happening anymore.
I don't think we need to change kernel.pidmax as explained in the systemd doc:

Specify the maximum number of tasks that may be created in the unit. This ensures that the number of tasks accounted for the unit (see above) stays below a specific limit. This either takes an absolute number of tasks or a percentage value that is taken relative to the configured maximum number of tasks on the system. If assigned the special value "infinity", no tasks limit is applied. This controls the "pids.max" control group attribute.

Comment 6 Ben England 2016-11-04 11:41:24 UTC

Sebastien, to clarify, this problem is occurring on compute nodes, not the Ceph nodes.   The compute nodes are where we are running qemu-kvm processes that have librbd linked into them.  

The configuration is OSP 10 layered on an externally configured Ceph cluster.  But I think this would be relevant even if OSPd was deploying Ceph nodes.

Comment 7 seb 2016-11-04 13:23:45 UTC

Right sorry I kinda missed the hypervisor part.
So next step is to increase kernel.pid_max to a very large value, something like 4194303?

Thanks!

Comment 8 Giulio Fidente 2016-11-10 09:24:29 UTC

With the new builds including this change, the kernel.pid_max value will default to 1048576.

It will be possible to customize this value using an environment file at deployment time. The environment file should look like the following:

  parameter_defaults:
    KernelPidMax: 4194303

Comment 11 Ben England 2016-11-17 12:30:16 UTC

Thanks for fixing this.

Comment 12 Yogev Rabl 2016-11-21 13:55:53 UTC

verified on openstack-tripleo-heat-templates-5.1.0-3.el7ost.noarch
#cat /proc/sys/kernel/pid_max
1048576

Comment 14 errata-xmlrpc 2016-12-14 16:26:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html

Note You need to log in before you can comment on or make changes to this bug.