Red Hat Bugzilla – Bug 104525
LTC4036-Large number of threads locks up the system
Last modified: 2007-11-30 17:06:58 EST
The following has be reported by IBM LTC:
Large number of threads locks up the system
Please fill in each of the sections below.
Hardware Environment: PIII 500mhz, 128mb ram, 1gb swap
Software Environment: RHAS 3.0 Beta1
Steps to Reproduce:
1. Create program to create threads
2. run program to create 10000 threads
Actual Results: fail to create threads as system hangs
Expected Results: create 10000 threads or error out gracefully
NPTL/LT - large # threads hangs OS, still investigating, but seen it hang at
around 3k and 7k NPTL with GUI (GNOME) up, 9673 NPTL with GUI inactive, 5k LT
with GUI up, 7770 LT with GUI inactive. Entire machine locks up and requires a
hard reboot. All my program does is to loop doing pthread_create (takes as
arguments the # of threads and the stack size of each thread); each thread
calls sleep() for a long period of time.
Program available upon request.Ken - sounds like you run out of pinned memory.
Did you try to increase
the amount of RAM on your system and see if that makes a difference ?Nope, did
not try to add memory (budget). I don't have a big iron machine to
test this on either, just don't have the systems. I did try changing the stack
size - the first series of tests were at 96kb stack sizes, at 256kb, LT failed
at 7565, which is pretty close to 7770 in # of threads. So, 7770 @96kb =
728mb, 7565 @ 256kb = 1.891 GB. Doesn't quite hash up if it's a pinned memory
thing, unless it is something happening along the way internally that I'm not
seeing (such as actually mapping in memory). If you have a big iron RHAS 3.0
Beta 1 system you can test this on, that would be great. Either way, I still
think it's bad that a user space program can hang the entire machine (hard boot
required) - very bad.
FYI: NPTL @96 was 9673 threads, @256 was 8035 threads.
Ken - We will test this on our large 8-way SMP system with RHEL3 beta1 later
today. Still waiting for labserve to get us a removable scsi drive for the 8-way
Installed RHEL AS 3.0 beta on milicent.ltc.austin.ibm.com.
You can telnet in remotely as root, our normap setup for password, etc. The
kernel source is on there, but I can't find glibc src rpm anywhere on the 3 CDs
they shipped. I will need to talk to Glen Johnson about that.
Mike Lepore says he needs the system back around 6 PM so I will shutdown it
down at 6 and let Mike use it overnight. You can have it back by morning.
Went to talk to Glen Johnson. He told me Beta 2 is coming out today.
Scott Russell should have it shadowed in the next day or so.
Y'all may want to Beta 2 to look at this versus Beta 1 ?
I have Beta 2 CDs downloaded and cut. Let me know if you want me to put that
Also, Beta 1 has source CDs. They are just not on our internal ftp site. Glen
is going to ask Scott about that.
SalinaSalina - thanks for your help. Yes, please install Beta2 on the machine.
Ok, will install Beta 2 on it tomorrow - takes about 2 hours to install it.
Mike is going to use the machine now for lkcd testing.
Beta 2 installed on milicent.
Have fun debugging.Mike - Please handle this bug for me. I am overwhelmed by
I realize your test case is simple, but could you attach the source code and
build instructions just to keep things as consistent as possible. Thanks.Khoa /
I also downloaded the source iso files for beta 2 now.
They are on /iso on milicent.
I installed the glibc source rpm and already did
so you can see the source code on /usr/src/redhat/BUILD/glibc-2.3.2-200308141835
I will be out tomorrow morning, but I left the 3 install CDs in my big black CD
case in the lab in case you need it. They are marked Taroon Beta 2 ( red ink ).
Created an attachment (id=1485)
Tar file of my simple thread program
I'm on vac, but I think this is a fairly recent version of my threads program.
I think this is not specifically related to RHEL3, but it is a NTPL/threading
issue, which is why my team is investigating.Mike,
I know you are on vacation today, so did some leg work for you.
I downloaded Ken's tar file and untarred it on /root/threads on milient for you
I briefly looked at what is there.
compit.l is the linux way to re-build executable.
I changed it to add
set -x ( traces commands )
added --save-temps to gcc option - so you can see the macro expansion
in filename.i files
We do not have NGPT installed, so you need not worry about that
To use Linuxthreads, before you execute LinuxServ
do export LD_ASSUME_KERNEL=2.2.5 and you will see the /lib/libpthread.so.0
if you do
you will see /lib/tls/libpthread.so.0 being loaded and thus using NTPL
Running on milicent ( Beta 2 installed ).
I am getting up to 11991@256K or 2029@1024K
I get ENOMEM ( rc = 12 ) from pthread_create but system still runs fine
typed too fast, it is 3029@1024 we got running on milicent Ken - please try your
testcase on RHEL3 beta2....Salina cannot reproduce
this bug on beta2 - it returns ENOMEM error which is expected. Thanks.I have
this and another bug to test with latest stuff, so, when I'm back from
Vac Tuesday I will download the latest stuff and try again. Thanks y'all!
(FYI: We are on vac in Cancun on our 6th yr Anniversary :)
I'll let y'all know as soon as I have results. (Back from vac Tuesday)
Please let us know if you still have problems with Beta 2. Thanks
SalinaWe should mention milicent is Netfinity 8500R ( 8681-8RY ), 8 CPU, 16G memory.
I installed machine with 2G swap space. I'm actually now on Beta 2, but having
problems getting the latest "patches"
due to the issue with SSL - I manually downloaded up2date and up2date-gnome,
but still cannot connect to RHN so I'm having to download packages
individually. Once I'm at the latest of Beta 2 I will retry the threading test.
Well, I updated to the latest packages from RHN and I can still duplicate the
issue on my system :( Got to 6606 threads @ 256kb using NPTL and everything
locked up on my machine. I know I'm pushing my little single cpu 500mhz, but I
just don't like that my little user space program can lock up the entire
system - I'm in runlevel 3 and I cannot ctrl-c, cannot change virtual
terminals, nada, gotta do a hard reboot. I'm on the 414 level of the latest
kernel with full patches. Ideas?
I was at run level 5 - X is up and everthing but I have a "big iron".
Khoa, do you want us to try to dup the problem now with one of the netvistas or
do you just want to report this ? NTPL is RedHat's own product but this
sounds like a kernel problem if entire machine locks up.
Let me know what you want to do. I am working on the p-series machine, or
maybe you want to investigate this yourself since you have more VMM expertise.
Mike Lepore is going to try to re-create and look into this with a Netvista /
Netfinity box we have in the lab.Just as an FYI - this happens with
LD_ASSUME_KERNEL=2.4.18 and hence using
LinuxThreads as well (on this run got 5847 threads @ 256kb).
We tried to reproduce the problem on the 8500R 8-way, running Beta2, to no
avail. We tried both the smp and up kernels. On both kernels we tried limiting
the memory by passing mem=128M to the kernel and also tried with full memory
(the smp kernel showed the full 16G system memory, and the up kernel showed
4G). In all cases we tried to reach 20000 threads at 256K and 1024K stack
sizes. When threads-max is reached, error code 11 occurs. When memory is
exhausted, error code 12 occurs. In cases where threads-max is reached first, I
tried bumping the value up and re-running the test. This would result in error
code 12. When error 11 occurs, anythink that requires a fork will fail
(starting a new session, issuing a command like ls, etc.). In the case where
error 12 occurs, the system simply slows down a little. In no case did the
system actually hang- we were able to switch to a new console,
minimized/maximize windows, Ctrl-C out of the test program, and kill the
program. The system was fine and appeared stable after all the test runs. All
test were run from the run level 5 x-window environment.
The next step will be to try to reproduce the problem on a IBM NetVista PC.
Ken, could you describe what you mean by "hang". Was the system locked solid
(i.e. no mouse/keyboard response at all)?
By Hang, I mean that the virtual window running the test would take no input
(ctrl-c, ctrl-z, nada) and I was unable to switch to another virtual window
(alt-F2, alt-F3, so on), and a soft reboot (ctrl-alt-del) would not do anything
and in order to reboot the machine I had to push the reset button. Left alone,
the system after 30minutes was still in the same state after I had done
multiple ctrl-c/ctrl-z combinations in the window. This was at run level 3, at
run level 5 all of X would hang - no mouse movement, no keystroke capabilities,
nada.Switched to console 4 to look at why you have problem installing Beta 1.
It looks like your cdrom is having problems - do not know if driver is not
included or device really has problems.
I switched to use floppy, then network install using skyline - Paul Edgar put
both Beta 1 and Beta 2 there already.
The machine almost finished installing Beta 1 now.
Have fun debugging.
See this page to find out how to create floppy install image ( it is for RHAS
2.1 ) but same for RHEL 3.0, or SuSE distros
The network device driver you need is 3c59x - it is written on the same note on
the machine with the ip addr, mask etc.
You can download test program etc yourself and debug
OK, Ken, thanks. I just wanted to make sure that you were indeed seeing
a "hard" hang. Unfortunately, we haven't been able to reproduce that condition
yet. I will try on our NetVista PC and report back the results.I reproduced the
problem with the following scenario:
IBM NetVista Model 6579-A4U
866MHz PIII (with 133MHz Front Side Bus)
-RHEL 3.0 Beta1
-Limited memory to 128MB by adding mem=128M kernel parameter
-Boot to run level 3.
-Increased /proc/sys/kernel/threads-max to 20000 (was 2048 by default).
-Ran ./LinuxServ 20000 96 (creates 20000 threads with 96K stack sizes)
After creating "many" threads, system hangs solid.
Not yet sure what in the scenario above is the key to reproducing this, nor am
I sure how many threads were created prior to failing, but I will do more
experimentation to narrow down the relevant factors and go from there.Was not
able to reproduce using the same scenario in previous comment, but
changing to mem=256M, so 128M seems to be relevant.
threads-max is determined based on system memory, and is set to 2048 for 128M.
(I noticed it set for 4096 for 256M, and 8192 for 512M). It appears from the
bug description thus far that this problem has only be seen with thread counts
higher than 2048 on the reported system, so I assume threads-max is being
Ken, are you able to reproduce this problem without increasing threads-max?
Will add that our system described in comment#33 has 1GB swap.No, without
changing the system threads max you are unable to create more than
2000 threads (the default) on the machine and the machine will not hang but will
instead have failures to create threads/processes. We always have to change
this value because 2000 is much to low of a limit for running Domino 6 with our
thread per connection model.Hi all,
Even with RedHat Beta 2, kernel is at level 399. There is already a problem
with NFS on it ( see bug# 4333 ). I found a later kernel - level 423 with
build date of 9/4. The 423 version fixed 4333.
Mike, you may want to upgrade your system with RHN updates before debugging.
SalinaCreated an attachment (id=1599)
threads testing program
Ahh, it didn't like my comments at same time as clicking create a new
attachment. Anyway, I'm at the 423 kernel level and still seeing the issue,
the attachment is a new version of my threads program (just the binary) which
prints out the thread number so that you can see how many it has created prior
to failure. Thanks!
kenboI also reproduced the problem on the following system:
x-Series 8500R, Type 8681-8RY
16 Gigabyte System Memory
I booted the RHEL 3.0-Beta2 UP kernel (2.4.21-1.1931.2.399.ent) with boot
parameter mem=128M, to runlevel 3. I increased /proc/sys/kernel/threads-max
from 2048 to 20000. I enabled the Magic SysRq keys by echo 1
> /proc/sys/kernel/sysrq. I ran ./LinuxServ 15000 96 (creating 15000 threads
with 96K stack sizes). After 11095 threads were created, the system hung. At
this point, I couldn't switch consoles or terminate the test app (Ctrl-C). I
was not able to ssh or telnet to the system in the hung state. I captured the
output of the Alt-SysRq-p sequence several times (dumps the CPU registers and a
call traceback). The CPU was still running, as several different calls were
happening each time I hit the key sequences. The following calls were the most
After capturing and logging several Alt-SysRq-p outputs, I did Alt-SysRq-m a
few times, which prints out current memory information. Free Pages is 6344K and
Highmem is 0K. I noticed that the only info that changes with each Alt-SysRq-m
I'll attach the log file containing the output from Alt-SysRq-p and Alt-SysRq-m
during the hang.
I will now try this on RHAS 2.1, to see if this a "new" beta problem, or
problem that has been there.Created an attachment (id=1609)
Alt-SysRq-p and Alt-SysRq-m output
I was not able to reproduce this under RHAS 2.1. I installed RHAS 2.1 on the
same PC on my previous experiment. I limited memory to 128MB using the mem=128M
kernel parameter. I increased /proc/sys/kernel threads-max to 20000 (This value
was only 1024 by default under this installation). I ran ./LinuxServ 20000 96.
Error code 11 occurred at thread 1022, but no hang occurred.
The behavior thus seems to be unique to the RHEL 3.0 betas.
Khoa, should we re-assign to RedHat at this point?Glen/Greg - we now think this
issue is specific to RHEL3 betas, so please
submit this to Red Hat. Thanks.
not that it's right for the kernel to hang, but the minimum amount of ram
supported by RHEL3 is 256Mb
------ Additional Comments From email@example.com 2003-16-09 15:15 -------
Well, if RHEL does not support running with less than 256MB of RAM, and we
cannot duplicate otherwise, then should I upgrade my system and we call this
not a bug? I guess it is up to RH and what they think about it and whether
or not they want to pursue the issue. Any thoughts? What do y'all think?
well I'm not jumping from joy about hanging even with 128Mb, however:
* increasing the thread value is a bit risky; the default is a value the kernel
considers safe for this amount of ram
* it's pretty low amount
So while I rather not have a hang, it's not a super critical bug that warrants
redesigning the VM at this stage.
Changing state to closed/wontfix as the minimum supported memory configuration