Bug 104525

Summary: LTC4036-Large number of threads locks up the system
Product: Red Hat Enterprise Linux 3 Reporter: IBM Bug Proxy <bugproxy>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-10-02 00:42:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description IBM Bug Proxy 2003-09-16 18:35:28 UTC
The following has be reported by IBM LTC:  
Large number of threads locks up the system
Please fill in each of the sections below.

Hardware Environment: PIII 500mhz, 128mb ram, 1gb swap

Software Environment: RHAS 3.0 Beta1


Steps to Reproduce:
1. Create program to create threads
2. run program to create 10000 threads
3.

Actual Results: fail to create threads as system hangs

Expected Results: create 10000 threads or error out gracefully

Additional Information:

NPTL/LT - large # threads hangs OS, still investigating, but seen it hang at 
around 3k and 7k NPTL with GUI (GNOME) up, 9673 NPTL with GUI inactive, 5k LT 
with GUI up, 7770 LT with GUI inactive.  Entire machine locks up and requires a 
hard reboot. All my program does is to loop doing pthread_create (takes as 
arguments the # of threads and the stack size of each thread); each thread 
calls sleep() for a long period of time.

Program available upon request.Ken - sounds like you run out of pinned memory. 
Did you try to increase
the amount of RAM on your system and see if that makes a difference ?Nope, did
not try to add memory (budget).  I don't have a big iron machine to 
test this on either, just don't have the systems.  I did try changing the stack 
size - the first series of tests were at 96kb stack sizes, at 256kb, LT failed 
at 7565, which is pretty close to 7770 in # of threads.  So, 7770 @96kb = 
728mb, 7565 @ 256kb = 1.891 GB.  Doesn't quite hash up if it's a pinned memory 
thing, unless it is something happening along the way internally that I'm not 
seeing (such as actually mapping in memory).  If you have a big iron RHAS 3.0 
Beta 1 system you can test this on, that would be great.  Either way, I still 
think it's bad that a user space program can hang the entire machine (hard boot 
required) - very bad.

FYI: NPTL @96 was 9673 threads, @256 was 8035 threads.

Thanks!
Ken - We will test this on our large 8-way SMP system with RHEL3 beta1 later
today. Still waiting for labserve to get us a removable scsi drive for the 8-way
Khoa,
Installed RHEL AS 3.0 beta on milicent.ltc.austin.ibm.com.
                              ipaddr 9.3.192.136.
You can telnet in remotely as root, our normap setup for password, etc.  The 
kernel source is on there, but I can't find glibc src rpm anywhere on the 3 CDs 
they shipped.   I will need to talk to Glen Johnson about that.
Mike Lepore says he needs the system back around 6 PM so I will shutdown it 
down at 6 and let Mike use it overnight.   You can have it back by morning.
Enjoy debugging.
Salina
Went to talk to Glen Johnson.  He told me Beta 2 is coming out today.
Scott Russell should have it shadowed in the next day or so.
Y'all may want to Beta 2 to look at this versus Beta 1 ?
SalinaKhoa,
I have Beta 2 CDs downloaded and cut.   Let me know if you want me to put that 
on milicent.
Also, Beta 1 has source CDs.  They are just not on our internal ftp site.  Glen 
is going to ask Scott about that.
SalinaSalina - thanks for your help.  Yes, please install Beta2 on the machine.
Thanks.lHi Khoa,
Ok, will install Beta 2 on it tomorrow - takes about 2 hours to install it.  
Mike is going to use the machine now for lkcd testing.  
Khoa,
Beta 2 installed on milicent.
Have fun debugging.Mike - Please handle this bug for me.  I am overwhelmed by
other stuff.
Thanks.Hi Ken,

I realize your test case is simple, but could you attach the source code and 
build instructions just to keep things as consistent as possible. Thanks.Khoa /
Mike,
I also downloaded the source iso files for beta 2 now.
They are on /iso on milicent.  
I installed the glibc source rpm and already did 
   rpmbuild -bp
so you can see the source code on /usr/src/redhat/BUILD/glibc-2.3.2-200308141835
tree
I will be out tomorrow morning, but I left the 3 install CDs in my big black CD 
case in the lab in case you need it.  They are marked Taroon Beta 2 ( red ink ).

Created an attachment (id=1485)
Tar file of my simple thread program

I'm on vac, but I think this is a fairly recent version of my threads program.

Thanks!

kenbo
I think this is not specifically related to RHEL3, but it is a NTPL/threading
issue, which is why my team is investigating.Mike,
I know you are on vacation today, so did some leg work for you.
I downloaded Ken's tar file and untarred it on /root/threads on milient for you 
already.
I briefly looked at what is there.
compit.l is the linux way to re-build executable.
I changed it to add 
   set -x ( traces commands )
   added --save-temps to gcc option - so you can see the macro expansion
     in filename.i files 
We do not have NGPT installed, so you need not worry about that

To use Linuxthreads, before you execute LinuxServ
do export LD_ASSUME_KERNEL=2.2.5 and you will see the /lib/libpthread.so.0 
being loaded 

if you do 
  unset LD_ASSUME_KERNEL
you will see /lib/tls/libpthread.so.0 being loaded and thus using NTPL

 

Running on milicent ( Beta 2 installed ).
I am getting up to 11991@256K or 2029@1024K
I get ENOMEM ( rc = 12 ) from pthread_create but system still runs fine 



typed too fast, it is 3029@1024 we got running on milicent Ken - please try your
testcase on RHEL3 beta2....Salina cannot reproduce
this bug on beta2 - it returns ENOMEM error which is expected.  Thanks.I have
this and another bug to test with latest stuff, so, when I'm back from 
Vac Tuesday I will download the latest stuff and try again.  Thanks y'all!

(FYI:  We are on vac in Cancun on our 6th yr Anniversary :)

I'll let y'all know as soon as I have results. (Back from vac Tuesday)

Thanks! 

kenbo
Hi Kenbo,

Please let us know if you still have problems with Beta 2.   Thanks
SalinaWe should mention milicent is Netfinity 8500R ( 8681-8RY ), 8 CPU, 16G memory.
I installed machine with 2G swap space. I'm actually now on Beta 2, but having
problems getting the latest "patches" 
due to the issue with SSL - I manually downloaded up2date and up2date-gnome, 
but still cannot connect to RHN so I'm having to download packages 
individually.  Once I'm at the latest of Beta 2 I will retry the threading test.

Well, I updated to the latest packages from RHN and I can still duplicate the 
issue on my system :(  Got to 6606 threads @ 256kb using NPTL and everything 
locked up on my machine.  I know I'm pushing my little single cpu 500mhz, but I 
just don't like that my little user space program can lock up the entire 
system - I'm in runlevel 3 and I cannot ctrl-c, cannot change virtual 
terminals, nada, gotta do a hard reboot.  I'm on the 414 level of the latest 
kernel with full patches.  Ideas?



kenbo
I was at run level 5 - X is up and everthing but I have a "big iron".
Khoa, do you want us to try to dup the problem now with one of the netvistas or 
do you just want to report this ?   NTPL is RedHat's own product but this 
sounds like a kernel problem if entire machine locks up.
Let me know what you want to do.  I am working on the p-series machine, or 
maybe you want to investigate this yourself since you have more VMM expertise.
Mike Lepore is going to try to re-create and look into this with a Netvista / 
Netfinity box we have in the lab.Just as an FYI - this happens with
LD_ASSUME_KERNEL=2.4.18 and hence using 
LinuxThreads as well (on this run got 5847 threads @ 256kb).
We tried to reproduce the problem on the 8500R 8-way, running Beta2, to no 
avail. We tried both the smp and up kernels. On both kernels we tried limiting 
the memory by passing mem=128M to the kernel and also tried with full memory 
(the smp kernel showed the full 16G system memory, and the up kernel showed 
4G). In all cases we tried to reach 20000 threads at 256K and 1024K stack 
sizes. When threads-max is reached, error code 11 occurs. When memory is 
exhausted, error code 12 occurs. In cases where threads-max is reached first, I 
tried bumping the value up and re-running the test. This would result in error 
code 12. When error 11 occurs, anythink that requires a fork will fail 
(starting a new session, issuing a command like ls, etc.). In the case where 
error 12 occurs, the system simply slows down a little. In no case did the 
system actually hang- we were able to switch to a new console, 
minimized/maximize windows, Ctrl-C out of the test program, and kill the 
program. The system was fine and appeared stable after all the test runs. All 
test were run from the run level 5 x-window environment.

The next step will be to try to reproduce the problem on a IBM NetVista PC.

Ken, could you describe what you mean by "hang". Was the system locked solid 
(i.e. no mouse/keyboard response at all)?



By Hang, I mean that the virtual window running the test would take no input 
(ctrl-c, ctrl-z, nada) and I was unable to switch to another virtual window 
(alt-F2, alt-F3, so on), and a soft reboot (ctrl-alt-del) would not do anything 
and in order to reboot the machine I had to push the reset button.  Left alone, 
the system after 30minutes was still in the same state after I had done 
multiple ctrl-c/ctrl-z combinations in the window.  This was at run level 3, at 
run level 5 all of X would hang - no mouse movement, no keystroke capabilities, 
nada.Switched to console 4 to look at why you have problem installing Beta 1.
It looks like your cdrom is having problems - do not know if driver is not 
included or device really has problems.
I switched to use floppy, then network install using skyline - Paul Edgar put 
both Beta 1 and Beta 2 there already.
The machine almost finished installing Beta 1 now.
Have fun debugging.

See this page to find out how to create floppy install image ( it is for RHAS 
2.1 ) but same for RHEL 3.0, or SuSE distros
http://www.redhat.com/docs/manuals/enterprise/RHEL-AS-2.1-Manual/install-guide/

The network device driver you need is 3c59x - it is written on the same note on 
the machine with the ip addr, mask etc.

You can download test program etc yourself and debug

Salina
 
OK, Ken, thanks. I just wanted to make sure that you were indeed seeing 
a "hard" hang. Unfortunately, we haven't been able to reproduce that condition 
yet. I will try on our NetVista PC and report back the results.I reproduced the
problem with the following scenario:

Hardware:

IBM NetVista Model 6579-A4U
866MHz PIII (with 133MHz Front Side Bus)
512MB memory

Software:

-RHEL 3.0 Beta1

-Limited memory to 128MB by adding mem=128M kernel parameter 
to /boot/grub/menu.lst

-Boot to run level 3.

-Increased /proc/sys/kernel/threads-max to 20000 (was 2048 by default).

-Ran ./LinuxServ 20000 96 (creates 20000 threads with 96K stack sizes)

After creating "many" threads, system hangs solid.

Not yet sure what in the scenario above is the key to reproducing this, nor am 
I sure how many threads were created prior to failing, but I will do more 
experimentation to narrow down the relevant factors and go from there.Was not
able to reproduce using the same scenario in previous comment, but 
changing to mem=256M, so 128M seems to be relevant.

threads-max is determined based on system memory, and is set to 2048 for 128M. 
(I noticed it set for 4096 for 256M, and 8192 for 512M). It appears from the 
bug description thus far that this problem has only be seen with thread counts 
higher than 2048 on the reported system, so I assume threads-max is being 
increased.

Ken, are you able to reproduce this problem without increasing threads-max?
Will add that our system described in comment#33 has 1GB swap.No, without
changing the system threads max you are unable to create more than
2000 threads (the default) on the machine and the machine will not hang but will
instead have failures to create threads/processes.  We always have to change
this value because 2000 is much to low of a limit for running Domino 6 with our
thread per connection model.Hi all, 
Even with RedHat Beta 2, kernel is at level 399.   There is already a problem 
with NFS on it ( see bug# 4333 ).  I found a later kernel - level 423 with 
build date of 9/4.  The 423 version fixed 4333.
Mike, you may want to upgrade your system with RHN updates before debugging.
SalinaCreated an attachment (id=1599)
threads testing program
Ahh, it didn't like my comments at same time as clicking create a new 
attachment.  Anyway, I'm at the 423 kernel level and still seeing the issue, 
the attachment is a new version of my threads program (just the binary) which 
prints out the thread number so that you can see how many it has created prior 
to failure.  Thanks!  

kenboI also reproduced the problem on the following system:

   x-Series 8500R, Type 8681-8RY
   16 Gigabyte System Memory
   8-way SMP

I booted the RHEL 3.0-Beta2 UP kernel (2.4.21-1.1931.2.399.ent) with boot 
parameter mem=128M, to runlevel 3. I increased /proc/sys/kernel/threads-max 
from 2048 to 20000. I enabled the Magic SysRq keys by echo 1 
> /proc/sys/kernel/sysrq. I ran ./LinuxServ 15000 96 (creating 15000 threads 
with 96K stack sizes). After 11095 threads were created, the system hung. At 
this point, I couldn't switch consoles or terminate the test app (Ctrl-C). I 
was not able to ssh or telnet to the system in the hung state. I captured the 
output of the Alt-SysRq-p sequence several times (dumps the CPU registers and a 
call traceback). The CPU was still running, as several different calls were 
happening each time I hit the key sequences. The following calls were the most 
commonly seen:

try_to_free_buffers
rebalance_laundry_zone
try_to_release_page
unlock_page
balance_dirty_state
nr_free_buffer_pages
page_referenced
launder_page
do_softirq
sync_page_buffers

After capturing and logging several Alt-SysRq-p outputs, I did Alt-SysRq-m a 
few times, which prints out current memory information. Free Pages is 6344K and 
Highmem is 0K. I noticed that the only info that changes with each Alt-SysRq-m 
was "inactive_laundry_pages".

I'll attach the log file containing the output from Alt-SysRq-p and Alt-SysRq-m 
during the hang.

I will now try this on RHAS 2.1, to see if this a "new" beta problem, or 
problem that has been there.Created an attachment (id=1609)
Alt-SysRq-p and Alt-SysRq-m output
I was not able to reproduce this under RHAS 2.1. I installed RHAS 2.1 on the 
same PC on my previous experiment. I limited memory to 128MB using the mem=128M 
kernel parameter. I increased /proc/sys/kernel threads-max to 20000 (This value 
was only 1024 by default under this installation). I ran ./LinuxServ 20000 96. 
Error code 11 occurred at thread 1022, but no hang occurred.

The behavior thus seems to be unique to the RHEL 3.0 betas.

Khoa, should we re-assign to RedHat at this point?Glen/Greg - we now think this
issue is specific to RHEL3 betas, so please
submit this to Red Hat.  Thanks.

Comment 1 Arjan van de Ven 2003-09-16 19:01:37 UTC
not that it's right for the kernel to hang, but the minimum amount of ram
supported by RHEL3 is 256Mb

Comment 2 IBM Bug Proxy 2003-09-16 19:17:44 UTC
------ Additional Comments From kenbo.com  2003-16-09 15:15 -------
Well, if RHEL does not support running with less than 256MB of RAM, and we 
cannot duplicate otherwise, then should I upgrade my system and we call this 
not a  bug?  I guess it is up to RH and what they think about it and whether 
or not they want to pursue the issue.  Any thoughts?  What do y'all think?

Thanks!

kenbo 

Comment 3 Arjan van de Ven 2003-09-16 19:22:18 UTC
well I'm not jumping from joy about hanging even with 128Mb, however:
* increasing the thread value is a bit risky; the default is a value the kernel
considers safe for this amount of ram
* it's pretty low amount

So while I rather not have a hang, it's not a super critical bug that warrants
redesigning the VM at this stage. 

Comment 4 Tim Burke 2003-10-02 00:42:22 UTC
Changing state to closed/wontfix as the minimum supported memory configuration
is 256M.