Bug 1461530 - librados experiencing pthread_create failure with 2000 guests
Summary: librados experiencing pthread_create failure with 2000 guests
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Eoghan Glynn
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-14 17:12 UTC by Ben England
Modified: 2019-09-09 16:16 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-30 14:31:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ben England 2017-06-14 17:12:42 UTC
plz correct component assignment, I'm not sure where this goes.

Description of problem:

In scale lab while attempting to run 3000 guests on 1000 Ceph OSDs in HCI environment, we saw Nova guests dying without Nova knowing about it.  Searching through KVM guest instance logs, I found that librados pthread_create call was failing.   Symptom same as bz 1389502, only now the kernel.pid_max param is elevated so this isn't the problem.  I think this was because of a limit for process count that was too low.  This limit would not be hit on smaller configurations.  This limit could easily be adjusted when OpenStack was installed by adding a line to /etc/security/limits.d/20-nproc.conf:

qemu soft nproc 32768

This high limit is required for Ceph in large-scale configs, as described here:

workaround is to use ulimit command or edit the above file, then stop and start the guests.

In the long run (how long?), Ceph Luminous will support "async messenger" that uses 1-2 orders of magnitude less threads, so this will stop being a problem then, but it could be a long time (1 year?) before this finds its way into RHCS being deployed by RHOSP, so we need this bz to deal with it until then.

Version-Release number of selected component (if applicable):


How reproducible:

the thread creation limit is very reproducible, using a test program that I wrote below.  We've already established in bz 1389502 that Ceph librados in each Nova guest's consumes thousands of threads in this configuration.

Steps to Reproduce:
1. create a RHOSP cluster backed by 1020 Ceph OSDs
2. create 2000 guests each with a Cinder volume on Ceph storage
3. pdsh -S -R ssh -w ^guests.list dd if=/dev/zero of=/dev/vdb bs=1024k


Actual results:

some of the guests fail to run the pdsh command with "broken pipe" errors, and in some of the /var/log/libvirt/qemu/instance*log files we get:

Thread::try_create(): pthread_create failed with error 11common/Thread.cc: In fu
nction 'void Thread::create(const char*, size_t)' thread 7f81c4787700 time 2017-
06-08 15:17:22.579034
common/Thread.cc: 160: FAILED assert(ret == 0)
 ceph version 10.2.5-37.el7cp (033f137cde8573cfc5a4662b4ed6a63b8a8d1464)
 1: (()+0x175375) [0x7f81d674e375]
...
 10: (clone()+0x6d) [0x7f81d160573d]


Expected results:

all guests should be able to write all of their cinder volumes without error.

Additional info:

sosreport for one of the osdcompute nodes will be attached.

here's a test program that will create a user-specified number of threads using pthread_create, so you can see easily whether or not we will get this error because of limits on number of subprocesses.

http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/openstack/thread-create.c

Using this, I observed that this program would fail when I tried to create 4096 threads, which apparently can happen with librados, but works fine for lesser thread counts.    This 4096 number comes from the above limits.d file.  

ben@bene-laptop openstack]$ ./thread-create 4096
thread count: 4096
cat /proc/12841/limits
Limit                     Soft Limit           Hard Limit           Units     
...
Max processes             4096                 4096                 processes 
...
x: 0
fatal: Error creating thread
errno 11: Resource temporarily unavailable

Comment 1 Red Hat Bugzilla Rules Engine 2017-06-14 17:12:53 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 3 Red Hat Bugzilla Rules Engine 2017-06-14 17:22:44 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 4 Daniel Berrangé 2017-06-15 08:45:15 UTC
(In reply to Ben England from comment #0)
> plz correct component assignment, I'm not sure where this goes.
> 
> Description of problem:
> 
> In scale lab while attempting to run 3000 guests on 1000 Ceph OSDs in HCI
> environment, we saw Nova guests dying without Nova knowing about it. 
> Searching through KVM guest instance logs, I found that librados
> pthread_create call was failing.   Symptom same as bz 1389502, only now the
> kernel.pid_max param is elevated so this isn't the problem.  I think this
> was because of a limit for process count that was too low.  This limit would
> not be hit on smaller configurations.  This limit could easily be adjusted
> when OpenStack was installed by adding a line to
> /etc/security/limits.d/20-nproc.conf:
> 
> qemu soft nproc 32768

This won't do anything. These limit files are only processed by PAM, and nothing runs PAM when launching QEMU processes. If the max process limit needs raising, then /etc/libvirt/qemu.conf needs changing.

> In the long run (how long?), Ceph Luminous will support "async messenger"
> that uses 1-2 orders of magnitude less threads, so this will stop being a
> problem then, but it could be a long time (1 year?) before this finds its
> way into RHCS being deployed by RHOSP, so we need this bz to deal with it
> until then.

Can you give a guideline on what the current thread usage is by librados ? Does number of threads scale vs number of volumes open, or vs number of OSDs in use, or a combination of both ?

Comment 5 Ben England 2017-06-16 15:46:51 UTC
Changing the name of the bz because clearly the original name was inaccurate.  You have a point Daniel. With a single guest doing dd to its cinder volume, problem doesn't happen, if it was some sort of per-process limit, this wouldn't happen.  Also, /proc/pid/limits shows a limit of 128K  threads, way higher than needed here - back in OSP 8 this was not true BTW.  Amd in /etc/libvirt/qemu.conf:

max_processes = 131072

But if you have 2048 guests each with 2300 threads, that makes 4 million threads, and the system-wide limit is:

# sysctl -a | grep pid_max
kernel.pid_max = 1048576

So there is our problem!

--------------------
root@benvm:~
# time dd if=/dev/zero of=/dev/vdb bs=1024k
dd: error writing ‘/dev/vdb’: No space left on device
102401+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 109.268 s, 983 MB/s

real	1m49.273s
user	0m0.055s
sys	1m3.618s
-----------------------

at the same time I logged thread count in this qemu-kvm process, it never grew beyond ~2300.  

------------------------
root@overcloud-osdcompute-5 ~]# while [ 1 ] ; do sleep 10 ; echo -n "`date` "  ; ls /proc/214862/task | wc -l ; done
Fri Jun 16 15:29:46 UTC 2017 652
Fri Jun 16 15:29:57 UTC 2017 652
Fri Jun 16 15:30:07 UTC 2017 652
Fri Jun 16 15:30:18 UTC 2017 652
Fri Jun 16 15:30:28 UTC 2017 652
Fri Jun 16 15:30:39 UTC 2017 652
Fri Jun 16 15:30:49 UTC 2017 652
Fri Jun 16 15:31:00 UTC 2017 652
Fri Jun 16 15:31:10 UTC 2017 652
Fri Jun 16 15:31:21 UTC 2017 1902
Fri Jun 16 15:31:32 UTC 2017 2290
Fri Jun 16 15:31:44 UTC 2017 2336
Fri Jun 16 15:31:56 UTC 2017 2346
Fri Jun 16 15:32:07 UTC 2017 2350
Fri Jun 16 15:32:19 UTC 2017 2360
Fri Jun 16 15:32:31 UTC 2017 2360
Fri Jun 16 15:32:43 UTC 2017 2360
-------------------

Here is the process:

[heat-admin@overcloud-osdcompute-5 ~]$ ps awux | grep qemu-kvm | grep -v grep
qemu      214862  0.8  0.1 2981032 510920 ?      Sl   14:31   0:26 /usr/libexec/qemu-kvm -name guest=instance-00000003,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-instance-00000003/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu Broadwell,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+vmx,+smx,+est,+tm2,+xtpr,+pdcm,+dca,+osxsave,+f16c,+rdrand,+arat,+tsc_adjust,+xsaveopt,+pdpe1gb,+abm,+rtm,+hle -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 66cc06c7-61e4-42d5-b8f2-6859cb448b82 -smbios type=1,manufacturer=Red Hat,product=OpenStack Compute,version=15.0.3-3.el7ost,serial=646d01b6-d863-4d73-90ce-7c535a7f2853,uuid=66cc06c7-61e4-42d5-b8f2-6859cb448b82,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-instance-00000003/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -object secret,id=virtio-disk0-secret0,data=JFsIIVGGxbfI0DRAJk7x3kCU4wmTBdNH0K/5dTWm6so=,keyid=masterKey0,iv=DucdDcKOW+CSwB1OWZdplA==,format=base64 -drive file=rbd:vms/66cc06c7-61e4-42d5-b8f2-6859cb448b82_disk:id=openstack:auth_supported=cephx\;none:mon_host=172.18.0.10\:6789\;172.18.0.13\:6789\;172.18.0.21\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,cache=writeback,discard=unmap -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:f0:1d:a0,bus=pci.0,addr=0x3 -add-fd set=2,fd=30 -chardev file,id=charserial0,path=/dev/fdset/2,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.16.0.37:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on

[heat-admin@overcloud-osdcompute-5 ~]$ more /proc/214862/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             131072               131072               processes 
Max open files            32769                32769                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       1030485              1030485              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

Comment 6 Ben England 2017-06-16 17:15:03 UTC
but the guests aren't all running on the same host.  So we can have at most 100 1-GB guests on a single host, and if each guest is only using 2300 threads we can only consume 230000 threads.   So kernel.pid_max is not in the way :-( The limit in /etc/libvirt/qemu.conf is not system-wide, it is per process right?  If so, we have a mystery.  I'll try to use above pthread_create program to see how far we can go.  Maybe there is some other knob that needs turning.

Comment 7 Daniel Berrangé 2017-06-16 17:17:23 UTC
The max process ulimit is *per user* - iow the limit applies cumulatively to all QEMU processes on that host, under the "qemu" user account.

Comment 8 Ben England 2017-06-16 21:10:46 UTC
I modified this program:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/openstack/thread-create.c 

to allow multiple processes to all create threads at the same time, much like qemu-kvm or ceph-osd processes do.  When I created a million threads, I found pthread_create error in one of the qemu-kvm VM logs in /var/log/libvirt/qemu/instance*log, but only on the node where I created a million threads.  So my best hypothesis at present is that combination of all processes, including qemu-kvm and ceph-osd processes, consumed enough threads that we hit the kernel.pid_max wall, but I don't know exactly how at this point.   

Because this is an HCI node, there are 34 ceph-osds also chewing up 2300 threads a piece, or ~70,000 threads.   But this doesn't get us to one million, the kernel.pid_max limit.  Even with theoretical limit of 100 guests at 2300 threads/guest + 34 ceph-osd processes, we only get to  230,000 + 70,000 threads, 1/3 of the way to the kernel.pid_max limit.

It is still pretty gross that we create this many threads/node at 1000-OSD scale, there may be subtle resource limitations involved with this that I'm not aware of.  256 GB / 300,000 threads gives you only 0.9 MB/thread, which means in theory memory is oversubscribed if all threads are activated, which might occur if Ceph is in recovery mode (backfilling data).  

Tim and I are trying to re-run this test if time permits and log the thread counts of the whole system during cinder volume creation and preallocation, we may not get it done before we lose the cluster.

Comment 10 melanie witt 2017-06-30 14:31:42 UTC
We're going to close this for now to get it off our dashboard, please re-open and needinfo me or Dan when you have reproduced the issue and have the data.


Note You need to log in before you can comment on or make changes to this bug.