Bug 120889
Summary: | LTC7569-System hang after the pid reach to 32768 even set pid_max large enough | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | IBM Bug Proxy <bugproxy> |
Component: | kernel | Assignee: | Ernie Petrides <petrides> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0 | CC: | dhowells, petrides |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | powerpc | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-12-20 20:55:02 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 130338 |
Description
IBM Bug Proxy
2004-04-14 21:29:53 UTC
Please retry your experiment with the "for" command in step 2 corrected so that the first comma is a semi-colon, like this: for ((i=18504;i<32768;i++)); do ls; done Without this correction, you have essentially tried to do "ls" commands forever (with no limit). If there is still a hang, please let us know. If not, please close this bug report. Thanks. -ernie ----- Additional Comments From liuyan.com 2004-04-14 22:21 ------- Sorry, it's just a typo. The command should be: for ((i=18504;i<32768;i++)); do ps; done And I verified it again this morning, the machine hang indeed. Sorry for the confusion. Thanks. ----- Additional Comments From olof.com(prefers email via olof.com) 2004-04-14 23:24 ------- Did this happen when you reached a total number of processes being 32768, or did you have fewer processes than that, just that the highest number was 32768? ----- Additional Comments From liuyan.com 2004-04-15 01:30 ------- This happened when the highest pid number reached to 32768. The test on a p640 this morning showed the machine is pingable, while cannot be connected by ssh. Running the simple bash loop above on a UP x86 box yields more spectacular results: printing eip: c01248d0 *pde = 00000000 Oops: 0002 parport_pc lp parport autofs audit 3c59x sg scsi_mod microcode keybdev mousedev hid input usb-uhci usbcore ext3 jbd CPU: 0 EIP: 0060:[<c01248d0>] Not tainted EFLAGS: 00010206 EIP is at do_exit [kernel] 0x280 (2.4.21-14.EL/i686) eax: 00c0342d ebx: c14fd980 ecx: c14fd000 edx: 00000000 esi: c1543140 edi: c3a1a524 ebp: c3476000 esp: c3477f98 ds: 0068 es: 0068 ss: 0068 Process ps (pid: 65536, stackpage=c3477000) Stack: c1548da8 c14fd980 00000000 c3476000 b75c4d54 bfffaa48 c0124a44 00000000 c3476000 b75c4d54 c038c06f 00000000 00000000 b75c3260 b75c4d54 b75c4d54 bfffaa48 000000fc 0000002b 0000002b 000000fc b75ebc32 00000023 00000246 Call Trace: [<c0124a44>] do_group_exit [kernel] 0x54 (0xc3477fb0) Code: ff 48 10 8b 45 10 8b 40 24 83 48 14 08 8b 85 80 00 00 00 85 Kernel panic: Fatal exception Unless this is coincidentally a separate architecture-specific bug, the cause is probably in common code. I have been able to reproduce two instances of the do_exit() panic but not at exactly the same EIP. ----- Additional Comments From kaena.com 2004-04-23 09:12 ------- Mark as 'high' to track into U3. ----- Additional Comments From khoa.com 2004-07-19 22:58 ------- Sachin - can your team help on this bug ? Thanks. ----- Additional Comments From liuyan.com 2004-07-19 04:46 ------- Also tested it on RHEL3 U3 0709 iso, kernel 2.4.21-17.EL, while this defect has not been fixed. Any update? Thanks. ----- Additional Comments From prashanth_t.com 2004-07-21 07:30 ------- I am able to recreate the problem, though intermittently, on a ppc64 system. Looking at it further. ----- Additional Comments From markwiz.com 2004-07-21 13:07 ------- What kernel/ISO/or RPM are you running with? ----- Additional Comments From liuyan.com 2004-07-21 22:32 ------- This is first found on RHEL3 U2, kernel 2.4.21-13.EL. Also reproduced on RHEL3 U3 0709 iso, kernel 2.4.21-17.EL. Thanks. ----- Additional Comments From prashanth_t.com 2004-07-22 07:05 ------- I came across a patch from Zhu on lkml which had fix in alloc_pidmap. This might be a patch related to this bug. But, I am not sure why this patch has not been included even in recent 2.6.7 kernels. Below is the link for the patch. Please apply this test patch and let us know the results. http://seclists.org/lists/linux-kernel/2004/Jan/0931.html ----- Additional Comments From davidyao.com 2004-07-27 03:19 ------- Just rebuild kernel on RHEL3 U3 0720 with the recommended patch to test, the system will not hang while still get errors after pid reached 32k. Below is the test process: 1. Modify the /usr/src/linux-2.4/kernel/pid.c [root@plinuxt17 kernel]# diff -u pid.c.bak pid.c --- pid.c.bak 2004-07-23 18:59:58.000000000 +0800 +++ pid.c 2004-07-23 19:01:54.000000000 +0800 @@ -120,6 +120,8 @@ } if (!offset || !atomic_read(&map->nr_free)) { +if (!offset) + map--; next_map: map = next_free_map(map, &max_steps); if (!map) 2. Rebuild kernel 3. Modify /proc/sys/kernel/pid_max from 32k to 40000 [root@plinuxt17 kernel]# echo 40000 > /proc/sys/kernel/pid_max [root@plinuxt17 kernel]# cat /proc/sys/kernel/pid_max 40000 4. Start test [root@plinuxt17 kernel]# for ((i=1567;i<32768;i++)); do ps; done ...... 32766 pts/1 00:00:00 ps PID TTY TIME CMD 1493 pts/1 00:00:31 bash 32767 pts/1 00:00:00 ps stat: Value too large for defined data type PID TTY TIME CMD stat: Value too large for defined data type PID TTY TIME CMD stat: Value too large for defined data type PID TTY TIME CMD stat: Value too large for defined data type PID TTY TIME CMD ----- Additional Comments From zhouwu.com 2004-07-27 06:54 ------- It seems that the above patch fix the hang problem, but hit another problem in utility "ps". Seen from the output, function stat return -EOVERFLOW, which stands for "Value too large for defined data type". While trying to determine the root cause, I find another interesting phenomena: running the above process with a old version ps(2.0.13) under patched kernel will not output stat error. But after the max pid reach 32768, ps will not output anything: ..... PID TTY TIME CMD PID TTY TIME CMD PID TTY TIME CMD PID TTY TIME CMD PID TTY TIME CMD PID TTY TIME CMD PID TTY TIME CMD PID TTY TIME CMD ...... you could logon 9.181.24.49(usr/pwd = root/plinux) to look into this. The current version of ps is 2.0.13, which I build from procps source package. The original version of ps is 2.0.17, and is renamed to /bin/ps.orig. You could also find the 2.0.13 source under /usr/src/redhat/BUILD/procps-2.0.13/. Thanks. ----- Additional Comments From prashanth_t.com 2004-07-28 06:14 ------- Not only 'ps' is listing any processes, but all the processes with pids >32768 are not listed using 'ps' command. But, /proc has those pids with correct information. Since the 'stat' error is giving some info to debug the problem with procps-2.0.17, I would like to have its sources on your system. I couldn't find the sources for this version. Can you please install the source rpm for procps from you cds. I am dowloading some of the iso images for rhel3-u2 to get these sources. ----- Additional Comments From zhouwu.com 2004-07-28 21:50 ------- Sorry, I don't have the source code of procps-2.0.17 and don't know where to get it either.(If I could, I should have got this version instead of 2.0.13). In the U3 Beta ISOes, there seems to be not any source rpm packages. Maybe we could ask for RedHat's help about where to get the latest source rpm of procps. ----- Additional Comments From zhouwu.com 2004-07-28 21:52 ------- procps-2.0.13 is from RHEL3 Update2. I got if from RHN. Just FYI. ----- Additional Comments From prashanth_t.com 2004-07-30 06:38 ------- I tried with procps-2.0.17 (from fedora) and I could see the error from 'stat'. Looking at the source code, 'ps' when executed with no options, returns 1 from table_accept( ) on satisfying certain conditions on euid/tty. I observe that when the process id >32767, table_accept( ) is returning 0. This was because on_our_tty( ) condtion failed since cached_tty was '0' in that case. cached_tty is set after get_proc_stats( ) which inturn calls stat( ), from where the error code is seen. This needs to be debugged more looking at stat( ). Since this bug looks to be different from the pid hang, I request you to open a new bug for this. I would continue working on this issue anyway. ----- Additional Comments From ssant.com 2004-08-03 00:06 ------- Changing the resolution to FIX ALREADY AVAILABLE. Will track the issue of ps not displaying output for pid's > 32767 in bug #10305 Thanks ----- Additional Comments From liuyan.com 2004-08-17 23:32 ------- We just installed the newest RHEL3-U3-re0813.1 ISOs, while the patch is still NOT included in the new ISOs. [root@plinuxt15 root]# uname -a Linux plinuxt15.cn.ibm.com 2.4.21-19.EL #1 SMP Thu Aug 12 23:21:44 EDT 2004 ppc64 ppc64 ppc64 GNU/Linux By the way, who is responsible for submitting the patch? and when will RedHat plan to apply the patch in the kernel? ----- Additional Comments From khoa.com 2004-08-19 00:07 ------- This bug report has been mirrored to Red Hat, so Red Hat should be able to access this patch. I've put this patch on my list and will send it to Mark Wisner at Red Hat tomorrow for extra awareness. ----- Additional Comments From ssant.com 2004-08-24 00:10 ------- Any update on this from RH? ----- Additional Comments From liuyan.com 2004-09-06 03:35 ------- It seems Red Hat still did NOT include the patch in the RHEL3 U3 GM ISOs which released on 09/04/2004. Thanks. I have just posted a patch to our internal review mailing list for addressing this problem. Unless it meets with significant resistance, it (or some variation of it) will be incorporated into U4 within the next few days. A fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.14.EL). ----- Additional Comments From liuyan.com 2004-11-03 02:52 EDT ------- I have tried this on RHEL3 U4 1020 isos, it can pass the test on a p630 and a Power5 SF4HV, while hung on another Power5 SF4HV. I will continue to investigate. Thanks. [root@plinuxt20 kernel]# cat /etc/issue Red Hat Enterprise Linux AS release 3 (Taroon Update 4) Kernel on an m [root@plinuxt20 kernel]# uname -r 2.4.21-21.EL [root@plinuxt20 kernel]# cat /proc/sys/kernel/pid_max 32768 [root@plinuxt20 kernel]# echo 40000 > /proc/sys/kernel/pid_max [root@plinuxt20 kernel]# cat /proc/sys/kernel/pid_max 40000 [root@plinuxt20 kernel]# for ((i=304;i<40005;i++)); do ps; done ... 14705 pts/0 00:00:43 bash 39998 pts/0 00:00:00 ps PID TTY TIME CMD 14705 pts/0 00:00:43 bash 39999 pts/0 00:00:00 ps PID TTY TIME CMD 14705 pts/0 00:00:43 bash 300 pts/0 00:00:00 ps PID TTY TIME CMD ... ----- Additional Comments From liuyan.com 2004-11-05 04:08 EDT ------- The hung on the SF4HV is more likely a hmc vterm problem. I tried several ways to test this bug, it all passed on RHEL3 U4 1020 isos. Thanks. ----- Additional Comments From liuyan.com 2004-11-05 04:09 EDT ------- Close it. Thanks. An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html |