Red Hat Bugzilla – Bug 1461530
librados experiencing pthread_create failure with 2000 guests
Last modified: 2017-06-30 10:31:42 EDT
plz correct component assignment, I'm not sure where this goes.
Description of problem:
In scale lab while attempting to run 3000 guests on 1000 Ceph OSDs in HCI environment, we saw Nova guests dying without Nova knowing about it. Searching through KVM guest instance logs, I found that librados pthread_create call was failing. Symptom same as bz 1389502, only now the kernel.pid_max param is elevated so this isn't the problem. I think this was because of a limit for process count that was too low. This limit would not be hit on smaller configurations. This limit could easily be adjusted when OpenStack was installed by adding a line to /etc/security/limits.d/20-nproc.conf:
qemu soft nproc 32768
This high limit is required for Ceph in large-scale configs, as described here:
workaround is to use ulimit command or edit the above file, then stop and start the guests.
In the long run (how long?), Ceph Luminous will support "async messenger" that uses 1-2 orders of magnitude less threads, so this will stop being a problem then, but it could be a long time (1 year?) before this finds its way into RHCS being deployed by RHOSP, so we need this bz to deal with it until then.
Version-Release number of selected component (if applicable):
the thread creation limit is very reproducible, using a test program that I wrote below. We've already established in bz 1389502 that Ceph librados in each Nova guest's consumes thousands of threads in this configuration.
Steps to Reproduce:
1. create a RHOSP cluster backed by 1020 Ceph OSDs
2. create 2000 guests each with a Cinder volume on Ceph storage
3. pdsh -S -R ssh -w ^guests.list dd if=/dev/zero of=/dev/vdb bs=1024k
some of the guests fail to run the pdsh command with "broken pipe" errors, and in some of the /var/log/libvirt/qemu/instance*log files we get:
Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In fu
nction 'void Thread::create(const char*, size_t)' thread 7f81c4787700 time 2017-
common/Thread.cc: 160: FAILED assert(ret == 0)
ceph version 10.2.5-37.el7cp (033f137cde8573cfc5a4662b4ed6a63b8a8d1464)
1: (()+0x175375) [0x7f81d674e375]
10: (clone()+0x6d) [0x7f81d160573d]
all guests should be able to write all of their cinder volumes without error.
sosreport for one of the osdcompute nodes will be attached.
here's a test program that will create a user-specified number of threads using pthread_create, so you can see easily whether or not we will get this error because of limits on number of subprocesses.
Using this, I observed that this program would fail when I tried to create 4096 threads, which apparently can happen with librados, but works fine for lesser thread counts. This 4096 number comes from the above limits.d file.
ben@bene-laptop openstack]$ ./thread-create 4096
thread count: 4096
Limit Soft Limit Hard Limit Units
Max processes 4096 4096 processes
fatal: Error creating thread
errno 11: Resource temporarily unavailable
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.
(In reply to Ben England from comment #0)
> plz correct component assignment, I'm not sure where this goes.
> Description of problem:
> In scale lab while attempting to run 3000 guests on 1000 Ceph OSDs in HCI
> environment, we saw Nova guests dying without Nova knowing about it.
> Searching through KVM guest instance logs, I found that librados
> pthread_create call was failing. Symptom same as bz 1389502, only now the
> kernel.pid_max param is elevated so this isn't the problem. I think this
> was because of a limit for process count that was too low. This limit would
> not be hit on smaller configurations. This limit could easily be adjusted
> when OpenStack was installed by adding a line to
> qemu soft nproc 32768
This won't do anything. These limit files are only processed by PAM, and nothing runs PAM when launching QEMU processes. If the max process limit needs raising, then /etc/libvirt/qemu.conf needs changing.
> In the long run (how long?), Ceph Luminous will support "async messenger"
> that uses 1-2 orders of magnitude less threads, so this will stop being a
> problem then, but it could be a long time (1 year?) before this finds its
> way into RHCS being deployed by RHOSP, so we need this bz to deal with it
> until then.
Can you give a guideline on what the current thread usage is by librados ? Does number of threads scale vs number of volumes open, or vs number of OSDs in use, or a combination of both ?
Changing the name of the bz because clearly the original name was inaccurate. You have a point Daniel. With a single guest doing dd to its cinder volume, problem doesn't happen, if it was some sort of per-process limit, this wouldn't happen. Also, /proc/pid/limits shows a limit of 128K threads, way higher than needed here - back in OSP 8 this was not true BTW. Amd in /etc/libvirt/qemu.conf:
max_processes = 131072
But if you have 2048 guests each with 2300 threads, that makes 4 million threads, and the system-wide limit is:
# sysctl -a | grep pid_max
kernel.pid_max = 1048576
So there is our problem!
# time dd if=/dev/zero of=/dev/vdb bs=1024k
dd: error writing ‘/dev/vdb’: No space left on device
102401+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 109.268 s, 983 MB/s
at the same time I logged thread count in this qemu-kvm process, it never grew beyond ~2300.
root@overcloud-osdcompute-5 ~]# while [ 1 ] ; do sleep 10 ; echo -n "`date` " ; ls /proc/214862/task | wc -l ; done
Fri Jun 16 15:29:46 UTC 2017 652
Fri Jun 16 15:29:57 UTC 2017 652
Fri Jun 16 15:30:07 UTC 2017 652
Fri Jun 16 15:30:18 UTC 2017 652
Fri Jun 16 15:30:28 UTC 2017 652
Fri Jun 16 15:30:39 UTC 2017 652
Fri Jun 16 15:30:49 UTC 2017 652
Fri Jun 16 15:31:00 UTC 2017 652
Fri Jun 16 15:31:10 UTC 2017 652
Fri Jun 16 15:31:21 UTC 2017 1902
Fri Jun 16 15:31:32 UTC 2017 2290
Fri Jun 16 15:31:44 UTC 2017 2336
Fri Jun 16 15:31:56 UTC 2017 2346
Fri Jun 16 15:32:07 UTC 2017 2350
Fri Jun 16 15:32:19 UTC 2017 2360
Fri Jun 16 15:32:31 UTC 2017 2360
Fri Jun 16 15:32:43 UTC 2017 2360
Here is the process:
[heat-admin@overcloud-osdcompute-5 ~]$ ps awux | grep qemu-kvm | grep -v grep
qemu 214862 0.8 0.1 2981032 510920 ? Sl 14:31 0:26 /usr/libexec/qemu-kvm -name guest=instance-00000003,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-instance-00000003/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu Broadwell,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+vmx,+smx,+est,+tm2,+xtpr,+pdcm,+dca,+osxsave,+f16c,+rdrand,+arat,+tsc_adjust,+xsaveopt,+pdpe1gb,+abm,+rtm,+hle -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 66cc06c7-61e4-42d5-b8f2-6859cb448b82 -smbios type=1,manufacturer=Red Hat,product=OpenStack Compute,version=15.0.3-3.el7ost,serial=646d01b6-d863-4d73-90ce-7c535a7f2853,uuid=66cc06c7-61e4-42d5-b8f2-6859cb448b82,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-instance-00000003/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -object secret,id=virtio-disk0-secret0,data=JFsIIVGGxbfI0DRAJk7x3kCU4wmTBdNH0K/5dTWm6so=,keyid=masterKey0,iv=DucdDcKOW+CSwB1OWZdplA==,format=base64 -drive file=rbd:vms/66cc06c7-61e4-42d5-b8f2-6859cb448b82_disk:id=openstack:auth_supported=cephx\;none:mon_host=172.18.0.10\:6789\;172.18.0.13\:6789\;172.18.0.21\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,cache=writeback,discard=unmap -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:f0:1d:a0,bus=pci.0,addr=0x3 -add-fd set=2,fd=30 -chardev file,id=charserial0,path=/dev/fdset/2,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.0.37:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
[heat-admin@overcloud-osdcompute-5 ~]$ more /proc/214862/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 131072 131072 processes
Max open files 32769 32769 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 1030485 1030485 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
but the guests aren't all running on the same host. So we can have at most 100 1-GB guests on a single host, and if each guest is only using 2300 threads we can only consume 230000 threads. So kernel.pid_max is not in the way :-( The limit in /etc/libvirt/qemu.conf is not system-wide, it is per process right? If so, we have a mystery. I'll try to use above pthread_create program to see how far we can go. Maybe there is some other knob that needs turning.
The max process ulimit is *per user* - iow the limit applies cumulatively to all QEMU processes on that host, under the "qemu" user account.
I modified this program:
to allow multiple processes to all create threads at the same time, much like qemu-kvm or ceph-osd processes do. When I created a million threads, I found pthread_create error in one of the qemu-kvm VM logs in /var/log/libvirt/qemu/instance*log, but only on the node where I created a million threads. So my best hypothesis at present is that combination of all processes, including qemu-kvm and ceph-osd processes, consumed enough threads that we hit the kernel.pid_max wall, but I don't know exactly how at this point.
Because this is an HCI node, there are 34 ceph-osds also chewing up 2300 threads a piece, or ~70,000 threads. But this doesn't get us to one million, the kernel.pid_max limit. Even with theoretical limit of 100 guests at 2300 threads/guest + 34 ceph-osd processes, we only get to 230,000 + 70,000 threads, 1/3 of the way to the kernel.pid_max limit.
It is still pretty gross that we create this many threads/node at 1000-OSD scale, there may be subtle resource limitations involved with this that I'm not aware of. 256 GB / 300,000 threads gives you only 0.9 MB/thread, which means in theory memory is oversubscribed if all threads are activated, which might occur if Ceph is in recovery mode (backfilling data).
Tim and I are trying to re-run this test if time permits and log the thread counts of the whole system during cinder volume creation and preallocation, we may not get it done before we lose the cluster.
We're going to close this for now to get it off our dashboard, please re-open and needinfo me or Dan when you have reproduced the issue and have the data.