Bug 120889

Summary:	LTC7569-System hang after the pid reach to 32768 even set pid_max large enough
Product:	Red Hat Enterprise Linux 3	Reporter:	IBM Bug Proxy <bugproxy>
Component:	kernel	Assignee:	Ernie Petrides <petrides>
Status:	CLOSED ERRATA	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.0	CC:	dhowells, petrides
Target Milestone:	---
Target Release:	---
Hardware:	powerpc
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-12-20 20:55:02 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	130338

Description IBM Bug Proxy 2004-04-14 21:29:53 UTC

The following has be reported by IBM LTC:  
System hang after the pid reach to 32768 even set pid_max large enough
Hardware Environment: pSeries 650, 2x1199MHz cpu, 4G Mem, 4G swap

Software Environment:
[root@plinuxt15 root]# uname -a
Linux plinuxt15.ppc.cn.ibm.com 2.4.21-13.EL #1 SMP Wed Apr 7 22:43:27
EDT 2004 
ppc64 ppc64 ppc64 GNU/Linux

Steps to Reproduce:
1. Modify the pid_max value to 50000;
[root@plinuxt15 root]# ps
  PID TTY          TIME CMD
18503 pts/0    00:00:00 bash
18849 pts/0    00:00:00 ps

[root@plinuxt15 root]# cat /proc/sys/kernel/pid_max 
32768

[root@plinuxt15 root]# echo 50000 > /proc/sys/kernel/pid_max

[root@plinuxt15 root]# cat /proc/sys/kernel/pid_max 
50000

2. Spawn some processes until the pid reached 32767;
[root@plinuxt15 root]# for ((i=18504,i<32768;i++)); do ls; done
......
  PID TTY          TIME CMD
18503 pts/0    00:00:13 bash
32766 pts/0    00:00:00 ps
  PID TTY          TIME CMD
18503 pts/0    00:00:13 bash
32767 pts/0    00:00:00 ps

3. The system hung after the pid reached 32768, 
[root@plinuxt15 root]# ps

Ping or ssh to the machine will fail.

This is the information in the hmc console:
plinuxt15.ppc.cn.ibm.com login: de_put: entry <NULL> already free!
de_put: entry <NULL> already free!
de_put: entry <NULL> already free!
de_put: entry <NULL> already free!
de_put: entry <NULL> already free!
de_put: entry <NULL> already free!
de_put: entry <NULL> already free!

Actual Results: System hung

Expected Results: The system should not hung.

Additional Information:

Comment 1 Ernie Petrides 2004-04-14 22:03:37 UTC

Please retry your experiment with the "for" command in step 2
corrected so that the first comma is a semi-colon, like this:

   for ((i=18504;i<32768;i++)); do ls; done

Without this correction, you have essentially tried to do "ls"
commands forever (with no limit).  If there is still a hang,
please let us know.  If not, please close this bug report.

Thanks.  -ernie

Comment 2 IBM Bug Proxy 2004-04-15 02:20:50 UTC

----- Additional Comments From liuyan.com  2004-04-14 22:21 -------
Sorry, it's just a typo. The command should be:

for ((i=18504;i<32768;i++)); do ps; done

And I verified it again this morning, the machine hang indeed. 

Sorry for the confusion. Thanks.

Comment 3 IBM Bug Proxy 2004-04-15 03:21:12 UTC

----- Additional Comments From olof.com(prefers email via olof.com)  2004-04-14 23:24 -------
Did this happen when you reached a total number of processes being 32768, or did
you have fewer processes than that, just that the highest number was 32768?

Comment 4 IBM Bug Proxy 2004-04-15 05:26:06 UTC

----- Additional Comments From liuyan.com  2004-04-15 01:30 -------
This happened when the highest pid number reached to 32768.

The test on a p640 this morning showed the machine is pingable, while cannot 
be connected by ssh.

Comment 5 Mark DeWandel 2004-04-20 17:32:40 UTC

Running the simple bash loop above on a UP x86 box yields more
spectacular results:

 printing eip:
c01248d0
*pde = 00000000
Oops: 0002
parport_pc lp parport autofs audit 3c59x sg scsi_mod microcode keybdev
mousedev hid input usb-uhci usbcore ext3 jbd
CPU:    0
EIP:    0060:[<c01248d0>]   Not tainted
EFLAGS: 00010206

EIP is at do_exit [kernel] 0x280 (2.4.21-14.EL/i686)
eax: 00c0342d   ebx: c14fd980   ecx: c14fd000   edx: 00000000
esi: c1543140   edi: c3a1a524   ebp: c3476000   esp: c3477f98
ds: 0068   es: 0068   ss: 0068
Process ps (pid: 65536, stackpage=c3477000)
Stack: c1548da8 c14fd980 00000000 c3476000 b75c4d54 bfffaa48 c0124a44
00000000
       c3476000 b75c4d54 c038c06f 00000000 00000000 b75c3260 b75c4d54
b75c4d54
       bfffaa48 000000fc 0000002b 0000002b 000000fc b75ebc32 00000023
00000246
Call Trace:   [<c0124a44>] do_group_exit [kernel] 0x54 (0xc3477fb0)

Code: ff 48 10 8b 45 10 8b 40 24 83 48 14 08 8b 85 80 00 00 00 85

Kernel panic: Fatal exception

Unless this is coincidentally a separate architecture-specific bug,
the cause is probably in common code.  I have been able to reproduce
two instances of the do_exit() panic but not at exactly the same EIP.

Comment 6 IBM Bug Proxy 2004-04-23 13:11:02 UTC

----- Additional Comments From kaena.com  2004-04-23 09:12 -------
Mark as 'high' to track into U3.

Comment 7 IBM Bug Proxy 2004-07-20 02:58:00 UTC

----- Additional Comments From khoa.com  2004-07-19 22:58 -------
Sachin - can your team help on this bug ?  Thanks.

Comment 9 IBM Bug Proxy 2004-07-21 01:03:28 UTC

----- Additional Comments From liuyan.com  2004-07-19 04:46 -------
Also tested it on RHEL3 U3 0709 iso, kernel 2.4.21-17.EL, while this defect 
has not been fixed. Any update? Thanks.

Comment 10 IBM Bug Proxy 2004-07-21 11:33:21 UTC

----- Additional Comments From prashanth_t.com  2004-07-21 07:30 -------
I am able to recreate the problem, though intermittently, on a ppc64 system. 
Looking at it further.

Comment 11 IBM Bug Proxy 2004-07-21 17:08:24 UTC

----- Additional Comments From markwiz.com  2004-07-21 13:07 -------
What kernel/ISO/or RPM are you running with?

Comment 12 IBM Bug Proxy 2004-07-22 02:33:39 UTC

----- Additional Comments From liuyan.com  2004-07-21 22:32 -------
This is first found on RHEL3 U2, kernel 2.4.21-13.EL. Also reproduced on RHEL3 
U3 0709 iso, kernel 2.4.21-17.EL. Thanks.

Comment 13 IBM Bug Proxy 2004-07-22 11:08:36 UTC

----- Additional Comments From prashanth_t.com  2004-07-22 07:05 -------
I came across a patch from Zhu on lkml which had fix in alloc_pidmap.  This
might be a patch related to this bug.  But, I am not sure why this patch has not
been included even in recent 2.6.7 kernels.  Below is the link for the patch. 
Please apply this test patch and let us know the results.  

http://seclists.org/lists/linux-kernel/2004/Jan/0931.html

Comment 15 IBM Bug Proxy 2004-07-27 07:19:50 UTC

----- Additional Comments From davidyao.com  2004-07-27 03:19 -------
Just rebuild kernel on RHEL3 U3 0720 with the recommended patch to test, the 
system will not hang while still get errors after pid reached 32k. 

Below is the test process:

1. Modify the /usr/src/linux-2.4/kernel/pid.c
[root@plinuxt17 kernel]# diff -u pid.c.bak  pid.c
--- pid.c.bak   2004-07-23 18:59:58.000000000 +0800
+++ pid.c       2004-07-23 19:01:54.000000000 +0800
@@ -120,6 +120,8 @@
        }

        if (!offset || !atomic_read(&map->nr_free)) {
+if (!offset)
+       map--;
 next_map:
                map = next_free_map(map, &max_steps);
                if (!map)

2. Rebuild kernel

3. Modify /proc/sys/kernel/pid_max from 32k to 40000
[root@plinuxt17 kernel]# echo 40000 > /proc/sys/kernel/pid_max
[root@plinuxt17 kernel]# cat /proc/sys/kernel/pid_max
40000

4. Start test
[root@plinuxt17 kernel]# for ((i=1567;i<32768;i++)); do ps; done
......
32766 pts/1    00:00:00 ps
  PID TTY          TIME CMD
 1493 pts/1    00:00:31 bash
32767 pts/1    00:00:00 ps
stat: Value too large for defined data type
  PID TTY          TIME CMD
stat: Value too large for defined data type
  PID TTY          TIME CMD
stat: Value too large for defined data type
  PID TTY          TIME CMD
stat: Value too large for defined data type
  PID TTY          TIME CMD

Comment 16 IBM Bug Proxy 2004-07-27 10:54:59 UTC

----- Additional Comments From zhouwu.com  2004-07-27 06:54 -------
It seems that the above patch fix the hang problem, but hit another problem in 
utility "ps". Seen from the output, function stat return -EOVERFLOW, which 
stands for "Value too large for defined data type". 

While trying to determine the root cause, I find another interesting 
phenomena: running the above process with a old version ps(2.0.13) under 
patched kernel will not output stat error. But after the max pid reach 32768, 
ps will not output anything:

.....
  PID TTY          TIME CMD
  PID TTY          TIME CMD
  PID TTY          TIME CMD
  PID TTY          TIME CMD
  PID TTY          TIME CMD
  PID TTY          TIME CMD
  PID TTY          TIME CMD
  PID TTY          TIME CMD
......

you could logon 9.181.24.49(usr/pwd = root/plinux) to look into this. The 
current version of ps is 2.0.13, which I build from procps source package. The 
original version of ps is 2.0.17, and is renamed to /bin/ps.orig. You could 
also find the 2.0.13 source under /usr/src/redhat/BUILD/procps-2.0.13/. 

Thanks.

Comment 17 IBM Bug Proxy 2004-07-28 10:14:45 UTC

----- Additional Comments From prashanth_t.com  2004-07-28 06:14 -------
Not only 'ps' is listing any processes, but all the processes with pids >32768
are  not listed using 'ps' command.  But, /proc has those pids with correct
information. 

Since the 'stat' error is giving some info to debug the problem with
procps-2.0.17, I would like to have its sources on your system.  I couldn't find
the sources for this version.  Can you please install the source rpm for procps
from you cds.  I am dowloading some of the iso images for rhel3-u2 to get these
sources.

Comment 18 IBM Bug Proxy 2004-07-29 01:50:06 UTC

----- Additional Comments From zhouwu.com  2004-07-28 21:50 -------
Sorry, I don't have the source code of procps-2.0.17 and don't know where to 
get it either.(If I could, I should have got this version instead of 2.0.13). 

In the U3 Beta ISOes, there seems to be not any source rpm packages. Maybe we 
could ask for RedHat's help about where to get the latest source rpm of 
procps.

Comment 19 IBM Bug Proxy 2004-07-29 01:55:17 UTC

----- Additional Comments From zhouwu.com  2004-07-28 21:52 -------
procps-2.0.13 is from RHEL3 Update2. I got if from RHN. Just FYI.

Comment 20 IBM Bug Proxy 2004-07-30 10:40:25 UTC

----- Additional Comments From prashanth_t.com  2004-07-30 06:38 -------
I tried with procps-2.0.17 (from fedora) and I could see the error from 'stat'. 
Looking at the source code, 'ps' when executed with no options, returns 1 from
table_accept( ) on satisfying certain conditions on euid/tty.  I observe that
when the process id >32767, table_accept( ) is returning 0.  This was because
on_our_tty( ) condtion failed since cached_tty was '0' in that case.  

cached_tty is set after get_proc_stats( ) which inturn calls stat( ), from where
the error code is seen.  This needs to be debugged more looking at stat( ).  

Since this bug looks to be different from the pid hang, I request you to open a
new bug for this.  I would continue working on this issue anyway.

Comment 21 IBM Bug Proxy 2004-08-03 04:06:12 UTC

----- Additional Comments From ssant.com  2004-08-03 00:06 -------
Changing the resolution to FIX ALREADY AVAILABLE.

Will track the issue of ps not displaying output for pid's > 32767 in bug #10305

Thanks

Comment 22 IBM Bug Proxy 2004-08-18 03:31:07 UTC

----- Additional Comments From liuyan.com  2004-08-17 23:32 -------
We just installed the newest RHEL3-U3-re0813.1 ISOs, while the patch is still 
NOT included in the new ISOs. 

[root@plinuxt15 root]# uname -a
Linux plinuxt15.cn.ibm.com 2.4.21-19.EL #1 SMP Thu Aug 12 23:21:44 EDT 
2004 ppc64 ppc64 ppc64 GNU/Linux

By the way, who is responsible for submitting the patch? and when will RedHat 
plan to apply the patch in the kernel?

Comment 23 IBM Bug Proxy 2004-08-19 04:11:16 UTC

----- Additional Comments From khoa.com  2004-08-19 00:07 -------
This bug report has been mirrored to Red Hat, so Red Hat should be able to
access this patch.  I've put this patch on my list and will send it to
Mark Wisner at Red Hat tomorrow for extra awareness.

Comment 26 IBM Bug Proxy 2004-08-24 04:12:25 UTC

----- Additional Comments From ssant.com  2004-08-24 00:10 -------
Any update on this from RH?

Comment 27 IBM Bug Proxy 2004-09-06 07:35:07 UTC

----- Additional Comments From liuyan.com  2004-09-06 03:35 -------
It seems Red Hat still did NOT include the patch in the RHEL3 U3 GM ISOs which 
released on 09/04/2004. Thanks.

Comment 30 Ernie Petrides 2004-09-26 10:23:56 UTC

I have just posted a patch to our internal review mailing list for
addressing this problem.  Unless it meets with significant resistance,
it (or some variation of it) will be incorporated into U4 within the
next few days.

Comment 31 Ernie Petrides 2004-09-30 12:36:06 UTC

A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.14.EL).

Comment 32 IBM Bug Proxy 2004-11-03 07:49:50 UTC

----- Additional Comments From liuyan.com  2004-11-03 02:52 EDT -------
I have tried this on RHEL3 U4 1020 isos, it can pass the test on a p630 and a 
Power5 SF4HV, while hung on another Power5 SF4HV. I will continue to 
investigate. Thanks.

[root@plinuxt20 kernel]# cat /etc/issue
Red Hat Enterprise Linux AS release 3 (Taroon Update 4)
Kernel 
 on an m

[root@plinuxt20 kernel]# uname -r
2.4.21-21.EL

[root@plinuxt20 kernel]# cat /proc/sys/kernel/pid_max
32768
[root@plinuxt20 kernel]# echo 40000 > /proc/sys/kernel/pid_max
[root@plinuxt20 kernel]# cat /proc/sys/kernel/pid_max
40000

[root@plinuxt20 kernel]# for ((i=304;i<40005;i++)); do ps; done
...
14705 pts/0    00:00:43 bash
39998 pts/0    00:00:00 ps
  PID TTY          TIME CMD
14705 pts/0    00:00:43 bash
39999 pts/0    00:00:00 ps
  PID TTY          TIME CMD
14705 pts/0    00:00:43 bash
  300 pts/0    00:00:00 ps
  PID TTY          TIME CMD
...

Comment 33 IBM Bug Proxy 2004-11-05 09:09:46 UTC

----- Additional Comments From liuyan.com  2004-11-05 04:08 EDT -------
The hung on the SF4HV is more likely a hmc vterm problem. I tried several ways 
to test this bug, it all passed on RHEL3 U4 1020 isos. Thanks.

Comment 34 IBM Bug Proxy 2004-11-05 09:10:07 UTC

----- Additional Comments From liuyan.com  2004-11-05 04:09 EDT -------
Close it. Thanks.

Comment 35 John Flanagan 2004-12-20 20:55:02 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html