Description of problem: ---------------------- In our OSP10d deployed test-bed [21 computes backed by 1043 Ceph OSDs], the compute node kernel.pid_max is set to 49152 but ceph (especially with a large number of OSDs) can consume far more than this for thread count. The problem occurs because the threads per guest in this env is much greater than the number of OSDs times the number of guests. While most of the OSD process threads are idle until OSD repair/backfill is initiated, they count towards that limit so Ceph typically recommends increasing kernel.pid_max to a much higher value. Our environment immediately hit this limit (also seen in Ceph tracker http://tracker.ceph.com/issues/16118) so we increased pid_max to a much larger setting which resolved guests being terminated during I/O to ceph storage. Without doing so, not only were a subset of guests terminated during I/O, a subset of those terminated guests would no longer successfully start in nova. If unchanged at deployment, RHOSP 10 will fail to scale on RHCS 2.0. Component Version-Release: ------------------------- Red Hat Enterprise Linux Server release 7.3 Beta (Maipo) kernel-3.10.0-510.el7.x86_64 ceph-*.x86_64 1:10.2.2-41.el7cp @rhos-10.0-ceph-2.0-mon-signed openstack-aodh-*.noarch 3.0.0-0.20160921151816.bb5103e.el7ost openstack-ceilometer-*.noarch 1:7.0.0-0.20160928024313.67bbd3f.el7ost openstack-cinder.noarch 1:9.0.0-0.20160928223334.ab95181.el7ost openstack-dashboard.noarch 1:10.0.0-0.20161002185148.3252153.1.el7ost openstack-glance.noarch 1:13.0.0-0.20160928121721.4404ae6.el7ost openstack-gnocchi-*.noarch 3.0.1-0.20160923180636.c6b2c51.el7ost openstack-heat-*.noarch 1:7.0.0-0.20160926200847.dd707bc.el7ost openstack-ironic-*.noarch 1:6.2.1-0.20160930163405.3f54fec.el7ost openstack-keystone.noarch 1:10.0.0-0.20160928144040.6520523.el7ost openstack-manila.noarch 1:3.0.0-0.20160916162617.8f2fa31.el7ost openstack-mistral-api.noarch 3.0.0-0.20160929083341.c0a4501.el7ost openstack-neutron.noarch 1:9.0.0-0.20160929051647.71f2d2b.el7ost openstack-nova-api.noarch 1:14.0.0-0.20160929203854.59653c6.el7ost openstack-puppet-modules.noarch 1:9.0.0-0.20160915155755.8c758d6.el7ost openstack-sahara.noarch 1:5.0.0-0.20160926213141.cbd51fa.el7ost openstack-selinux.noarch 0.7.9-1.el7ost @rhos-10.0-puddle openstack-swift-account.noarch 2.10.1-0.20160929005314.3349016.el7ost openstack-swift-plugin-swift3.noarch 1.11.1-0.20160929001717.e7a2b88.el7ost openstack-zaqar.noarch 1:3.0.0-0.20160921221617.3ef0881.el7ost openvswitch.x86_64 1:2.5.0-5.git20160628.el7fdb puppet-ceph.noarch 2.2.0-1.el7ost @rhos-10.0-puddle puppet-openstack_extras.noarch 9.4.0-1.el7ost @rhos-10.0-puddle puppet-openstacklib.noarch 9.4.0-0.20160929212001.0e58c86.el7ost puppet-vswitch.noarch 5.4.0-1.el7ost @rhos-10.0-puddle python-openstack-mistral.noarch 3.0.0-0.20160929083341.c0a4501.el7ost python-openstackclient.noarch 3.2.0-0.20160914003636.8241f08.el7ost python-openstacksdk.noarch 0.9.5-0.20160912180601.d7ee3ad.el7ost python-openvswitch.noarch 1:2.5.0-5.git20160628.el7fdb python-rados.x86_64 1:10.2.2-41.el7cp @rhos-10.0-ceph-2.0-mon-signed python-rbd.x86_64 1:10.2.2-41.el7cp @rhos-10.0-ceph-2.0-mon-signed How reproducible: ---------------- consistent Steps to Reproduce: ------------------ 1. Leave pid_max at default value 2. Start I/O on KVM guests to ceph storage 3. See a subset of the KVM guests terminated on the compute nodes and all I/O hang Actual results: -------------- A subset of running guests are terminated (see errors in Additional Info below), remaining guest I/O hangs. Expected results: ---------------- No guests are terminated and I/O completes. Additional info: --------------- Guest termination observed in messages: Oct 27 14:29:08 overcloud-novacompute-0 journal: internal error: End of file from monitor Oct 27 14:29:08 overcloud-novacompute-0 kvm: 24 guests now active Oct 27 14:29:08 overcloud-novacompute-0 systemd-machined: Machine qemu-70-instance-000005d2 terminated. Oct 27 14:29:08 overcloud-novacompute-0 journal: End of file while reading data: Input/output error Pthread creation failure observed in qemu instance logs: Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In function 'void Thread::create(const char*, size_t)' thread 7f3a7249f700 time 2016-10-26 18:03:15.108590 common/Thread.cc: 160: FAILED assert(ret == 0) ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee) 1: (()+0x170635) [0x7f3a90958635] 2: (()+0x193f3a) [0x7f3a9097bf3a] 3: (()+0x333cd5) [0x7f3a90b1bcd5] 4: (()+0x33438e) [0x7f3a90b1c38e] 5: (()+0xd006e) [0x7f3a908b806e] 6: (()+0xd0cd7) [0x7f3a908b8cd7] 7: (()+0xd3e92) [0x7f3a908bbe92] 8: (()+0xd41ad) [0x7f3a908bc1ad] 9: (()+0xa829b) [0x7f3a9089029b] 10: (librados::IoCtx::aio_operate(std::string const&, librados::AioCompletion*, librados::ObjectWriteOperation*, unsigned long, std::vector<unsigned long, std::allocator<unsigned long> >&)+0xe1) [0x7f3a9085cf71] 11: (()+0x83729) [0x7f3a99fe7729] 12: (()+0x83c4b) [0x7f3a99fe7c4b] 13: (()+0x8654e) [0x7f3a99fea54e] 14: (()+0x842dd) [0x7f3a99fe82dd] 15: (()+0x743b9) [0x7f3a99fd83b9] 16: (()+0x8b8aa) [0x7f3a99fef8aa] 17: (()+0x9d7ed) [0x7f3a908857ed] 18: (()+0x85cd9) [0x7f3a9086dcd9] 19: (()+0x16f7e6) [0x7f3a909577e6] 20: (()+0x7dc5) [0x7f3a8bb04dc5] 21: (clone()+0x6d) [0x7f3a8b83373d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2016-10-26 18:03:15.389+0000: shutting down We measured as many as 2000 threads inside a single qemu-kvm process. Below we determine the number of threads per qemu-kvm process, most of which are related to librbd. e.g., dump out all the threads in qemu-kvm processes and then count how many are in each process: # ps -eLf | grep qemu-kvm | grep -v grep > /tmp/p # ps awux | grep qemu-kvm | grep -v grep | awk '{ print $2 }' > /tmp/pids # for p in `cat /tmp/pids` ; do echo -n "$p " ; awk "/ $p /" /tmp/p | wc -l ; done 9199 550 9496 562 141001 2029 141471 2027 142298 2030 143125 2027 143954 2028 144823 2027 145381 2027 146159 2028 146759 2028 147271 2030 147892 2027 148405 2027 148854 2027 149711 2029 149938 2027 150687 2027 151290 2029 151977 2028 152490 2027 153036 2028 153457 2027 154167 2027 154820 2029 and the total number of threads in just the qemu-kvm processes: # for p in `cat /tmp/pids` ; do echo -n "$p " ; awk "/ $p /" /tmp/p | wc -l ; done | awk '{sum += $2}END{print sum}' 47752
cc'ing Jeff Brown in RHS. This problem impacts OpenStack-Ceph scalability and specifically use of Ceph in the OpenStack scale lab.
I think it defaults to 32K for compatibility with 32bit systems; for 64bit systems it is limited to 4M instead, as it consumes a more memory for higer values do you think defaulting to 1M would be reasonable?
On OSP10, the systemd unit file of an OSD already has "TasksMax=infinity" so this shouldn't be happening anymore. I don't think we need to change kernel.pidmax as explained in the systemd doc: Specify the maximum number of tasks that may be created in the unit. This ensures that the number of tasks accounted for the unit (see above) stays below a specific limit. This either takes an absolute number of tasks or a percentage value that is taken relative to the configured maximum number of tasks on the system. If assigned the special value "infinity", no tasks limit is applied. This controls the "pids.max" control group attribute.
Sebastien, to clarify, this problem is occurring on compute nodes, not the Ceph nodes. The compute nodes are where we are running qemu-kvm processes that have librbd linked into them. The configuration is OSP 10 layered on an externally configured Ceph cluster. But I think this would be relevant even if OSPd was deploying Ceph nodes.
Right sorry I kinda missed the hypervisor part. So next step is to increase kernel.pid_max to a very large value, something like 4194303? Thanks!
With the new builds including this change, the kernel.pid_max value will default to 1048576. It will be possible to customize this value using an environment file at deployment time. The environment file should look like the following: parameter_defaults: KernelPidMax: 4194303
Thanks for fixing this.
verified on openstack-tripleo-heat-templates-5.1.0-3.el7ost.noarch #cat /proc/sys/kernel/pid_max 1048576
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html