Description of problem: We have two *production* servers (1,2) running RHEL AS3 Update1 with all errata packages installed. Server 1 was running RHL 7.3 until dec. 30 2003, and server 2 was running RHL 9 until march 25 2004, both servers were stable. Since the upgrade to RHEL both servers have been suffering periodical hangs with different intervals of stability but with the same symptoms: + The server stops responding to all services with no apparent degradation. It still returns ping requests and responds to sysrq. + We can showMem, showPc, showTasks, shoWcpus + If we tErm or kIll only a few processes die + A request to Unmount can't Remount R/O all filesystems and always stops remounting the same device (on both servers this is a +64GB ext3 filesystem with quotas enabled). + Finally we have to reBoot Version-Release number of selected component (if applicable): Currently 2.4.21-9.0.3.ELsmp, but same behaviour with 2.4.21-9.0.1.ELsmp How reproducible: n/a Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: (1) n400: + Fujitsu-Siemens PRYMERGY N400, 4 x Pentium III XEON 700MHz + 4 GB RAM + 4 x 36GB internal, SW RAID (md) + 8 x 36GB external, HW RAID on a Mylex AcceleRaid 352 + Using LVM as volume manager, ext3 as filesystem with quotas enabled on user's filesystems. (2) dl580: + HP Proliant DL580 G2, 4 x Intel Xeon MP 2GHz + 8GB RAM + 4 x 73GB internal HD, HW RAID on a SA 5i+ + 8 x 146GB external HD, HW RAID on a SACS connected to a SA 532 + Using LVM as volume manager, ext3 as filesystem with quotas enabled on user's filesystems.
Created attachment 99813 [details] Server 1 (n400) freeze #3 Console output for SysRq request of frozen server 1. We don't have the output associated to the previous two freezes.
Created attachment 99814 [details] Server 2 (dl580) freeze #4 Console output for SysRq request of frozen server 2.
Created attachment 99815 [details] Server 2 (dl580) freeze #5 Console output for SysRq request of frozen server 2.
Created attachment 99816 [details] Server 2 (dl580) freeze #7 Console output for SysRq request of frozen server 2. Note: on this console output showTasks has been called twice, before and after trying to kill tasks (tErm, kIll).
OK, looking at it some more I see a potential problem: 1) irqbalance opens /proc/interrupts, which ends up doing a GFP_KERNEL allocation from interrupts_open() 2) kswapd dives into the filesystem code to free memory (in order to satisfy the allocation) Al, could this result in a locking problem? Would it be better if interrupts_open() did its allocation with GFP_NOFS ?
What locking problem? Caller of interrupts_open() is not holding any locks; for all practical purposes we are talking about allocation in sys_open() and it's _definitely_ allowed to make GFP_KERNEL allocations.
Al, thanks for confirming that that's not the issue here. I'll take another look at the traces to see if there's anything else suspicious...
Juanjo, can you get the RHEL3-U2/update 2 kernel and try it? I think this problem has been fixed. In the mean time, please try "echo 30 > /proc/sys/vm/inactive_clean_percent" and re-run the workload. Larry Woodman
Larry, as fas as I know RHEL3-U2 is still beta and kernel-smp-2.4.21-14.EL.i686.rpm is vulnerable to RHSA-2004:183 and as these servers allow interactive user access (student, faculty staff ...) we can't run a vulnerable kernel. If all we need from RHEL3-U2 kernel is the new "vm.inactive_clean_percent = 30" default, we can put this value in sysctl.conf until RHEL3-U2 be released or a patched kernel is available. Anyway I configured "vm.inactive_clean_percent = 30" last friday on both servers. Unfortunately, last saturday (*) froze, but the scenario slightly changed, maybe due to the new inactive_clean_percent setting: + The server stops responding to all services with no apparent degradation. It still returns ping requests, responds to sysrq and agetty on serial console is able to spawn login and I could nearly log as root (user/password accepted, motd showm but no shell prompt). + We can showMem, showPc, showTasks, shoWcpus + tErm was able to kill all user processes. + Then we logged in as root, took some additional info and rebooted the server. Unfortunately the server stuck on a rc script, issued a tErm again and finally the server got absolutely frozen (no ping, no sysrq ... nothing). (*) Since the server 2 upgrade, it has not been up for more than 6 days (indeed most freezes take exactly 6 days to spot) and it was restarted april 25th.
Created attachment 99899 [details] Server 2 (dl580) freeze #8 (with comments) Server 2 (dl580) freeze #8 grep "^COMMENT:" for added comments to the console output
Created attachment 99907 [details] Server 2 (dl580) memory usage until freeze #8 This is a modified "sar -r" output where the last column computes real memory usage (in MB) as: memused - buffers - cached You'll see how this value increases since 04/25/2004 09:50:00 AM (server reboot) until 05/01/2004 07:57:13 PM (server freeze #8), we think this memory consumption does not correspond to user processes usage, because this server has a regular load due e-mail (SMTP, POP and IMAP) and a variable load (interactive sessions, database sessions, smb disk access ...) that starts at 8:00 and ends at 22:00. See also "free" and "ps -alfy" output in attachment "Server 2 (dl580) freeze #8 (with comments)", where the sum of RSS of all processes is far beyond the real memory usage, we guess the kernel is using this memory, but it seems too much memory usage for us. Please note that "vm.inactive_clean_percent = 30" was set on 04/30/2004 10:30:00 PM approximately. We also have full sar statistics if more information is required.
Created attachment 99943 [details] showMem for loopback interface troubles
Comment on attachment 99943 [details] showMem for loopback interface troubles On april 22th we detected problems with CUPS and SMTP (postfix+amavis sandwitch), at a first glance it seemed like two independent issues, although both services use 127.0.0.1 for process communication (lp-cupsd, postfix-amavis). After some testing we detected that cups could print files <15KB (approx), but 'lp' always locked when trying to print bigger files. Stracing 'lp' we noticed it was blocked receiving from 'cups' just after sending the file being printed (look for strace output on next attachment). As the default mtu for interface 'lo' is 16436 and a 15KB file fits on a single package and a 16KB file (plus TCP/IP headers) don't, we tested with differents values, and noticed that lowering lo MTU to 1500 solved the problem. We don't know if this problem is related to the server freezes, but today it has shown again on server 2 (it never happened to server 1) and, again, setting lo MTU to 1500 has worked. We haven't rebooted the server, so if more testing is needed or if a new bug has to be open for this issue please let us know. Regards.
Created attachment 99949 [details] 'lp' strace output for loopback interface troubles
Juanjo, what is running on this system? The reason I ask is because the memory allocation for lowmem looks rather strange, all of the lowmem has been allocated to the pagecache and none has been allocated to anonymous memory regions. This is very unusual. ( Active: 440307/257888, inactive_laundry: 38729, inactive_clean: 38827, free: 1161926 ) aa:0 ac:0 id:0 il:0 ic:0 fr:2942 aa:0 ac:109614 id:20627 il:3071 ic:3134 fr:1440 aa:62489 ac:268204 id:237261 il:35658 ic:35693 fr:1157544 Larry Woodman
Ah!!!, I know what the problem is here and I already fixed it in RHEL3-U2. We were not properly casting in in an unsigned char within an if statement and that resulted in this exact system hang. We need to get you running RHEL3-U2 asap! Larry Woodman ******** patch to rebalance_laundry_zone that fixed the hang ********* -if (now - page->age > 30) { +if ((unsigned char)(now - page->age) > 30) { ********************** hang traceback without this fix *************** [<c0133db5>] schedule_timeout [kernel] 0x65 (0xc9083e38) [<c0133d40>] process_timeout [kernel] 0x0 (0xc9083e58) [<c0145b88>] wait_on_page_timeout [kernel] 0xa8 (0xc9083e70) [<c0165b07>] try_to_free_buffers [kernel] 0x147 (0xc9083e94) [<c0152ad8>] rebalance_laundry_zone [kernel] 0x218 (0xc9083eac) [<c0155b1d>] __alloc_pages [kernel] 0x28d (0xc9083edc) [<c0155c5c>] __get_free_pages [kernel] 0x1c (0xc9083f20) [<c0125adf>] dup_task_struct [kernel] 0x5f (0xc9083f24) [<c012642b>] copy_process [kernel] 0x7b (0xc9083f38) [<c021da30>] sock_map_fd [kernel] 0x70 (0xc9083f40) [<c0126f4e>] do_fork [kernel] 0x4e (0xc9083f68) [<c0109d09>] sys_clone [kernel] 0x49 (0xc9083fa0)
Hi Larry, is there a chance to patch our current kernel (2.4.21-9.0.3.ELsmp) with rebalance_laundry_zone patch, or to patch RHEL3-U2 (kernel-smp-2.4.21-14.EL) with ip_setsockopt security patch? Both issues are very important for us. Thanks.
The very latest RHEL3-U2 kernel contains both patches you need. Larry
Hello, Juanjo. I just want to confirm that the latest RHEL3 U2 kernel, which is version 2.4.21-15.EL, contains the same security errata fix to ip_setsockopt() that was released in RHSA-2004:183 (kernel version 2.4.21-9.0.3.EL), which you refer to in comments #10 and #18. We intend to officially release the RHEL3 U2 errata next week, and its advisory id is RHSA-2004:188. (It's currently at the end of the external beta testing period.) Cheers. -ernie
Hi Larry, Ernie. Thanks for your fast response, but I am unable to find neither kernel 2.4.21-15.EL nor advisory RHSA-2004:188 at rhn.redhat.com. The current kernel on "Red Hat Enterprise Linux AS (v. 3 for x86) Beta" is 2.4.21-14.EL I guess we will have to wait until you officially release the RHEL3 U2 errata, is this true? Best regards, Juanjo.
Hi, Juanjo. I have just verified with our RHN team that the -15.EL kernel (respun 2 weeks ago for the security issue) was intentionally not pushed into the beta channel (for obscure process reasons). But it is scheduled to be pushed into the main update channel on Monday, after which time you should be able to upgrade via RHSA-2004:188. If you are unable to access RHSA-2004:188 by Tuesday, March 11th, please feel free to contact me for a status update. Cheers. -ernie
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-188.html
Created attachment 100368 [details] Server 2 (dl580) freeze #9 with latest kernel Hi, We are reopening this bug because after upgrading our servers to RHEL AS 3 Update 2 we have experienced again the problems that led us to open this bug. Server 2 (dl580) was running 2.4.21-15.ELsmp since May 13th and this server has frozen after 7 days of uptime (note that 7 days was the usual uptime between freezes with the previous kernel) with the same sympthoms (only responds to ping and sysrq). You will find attached the console output for the sysrq requests (Memory, Tasks, Umount ...). As we noted previously, this server was running RHL 9 until march 25 2004 and I don't know if this is relevant but with RHL9 we experienced similar periodic freezes (in that case we were able to kill all tasks from sysrq, log in the system and reboot) and the workaround was to use a PAE-disabled kernel, so we switched from kernel-bigmem to kernel-smp and got a rock-solid server at the expense of losing 4GB. With RHEL 3 kernel-smp already has PAE enabled so we can't test the same workaround, but we are thinking about modifying kernel-2.4.21-15.EL.src.rpm to generate an additional kernel-smp4g package with CONFIG_HIGHMEM4G=y instead of CONFIG_HIGHMEM64G=y. What do you think about that? Best regards
This is not a memory problem, none of the zones are all that badly depleted. Please get me several AltSysrq T, P, W and M outputs when the system is in this state so I can see exactly what the processes are stuck on. Larry Woodman
Created attachment 101240 [details] Server 2 (dl580) freeze #10 with latest kernel Console output for SysRq request of frozen server 2 (05/25/2004). The server had 5 days of uptime.
Created attachment 101241 [details] Server 1 (n400) freeze #4 Server 1 (n400) freeze #4 with kernel 2.4.21-15.ELsmp. Console output for SysRq request of frozen server 1 (14/06/2004). The server had 27 days of uptime.
Hi Larry, I have attached the console output for the two last freezes. As usual, you will find AltSysrq T, P, W and M before and after AltSysrq E and I. Please tell me if this is not what you need. Regards, Juanjo
Created attachment 101243 [details] Server 2 (dl580) freeze #11 with PAE-disabled kernel Console output for SysRq request of frozen server 2 (06/18/2004). You will find two AltSysrq T, P, W and M before AltSysrq I, and two after. The server had 2 days of uptime and was running a modified version of 2.4.21-15.ELsmp with CONFIG_HIGHMEM4G=y. It was previously running with this kernel during 18 days with no problems (it was reboot for maintenance reasons). Prior to the freeze, the server had a load average of 24.318. Now the server is running the lastest errata kernel (2.4.21-15.0.2.ELsmp not modified).
I am having the same problems with RH9.0 kernel 2.4.20-31.9 bigmem. Running on Compaq DL380's. Seems to occur when we get high ftp uploads or high user processing. papaz
Created attachment 101370 [details] Server 2 (dl580) freeze #12 with latest kernel Console output for SysRq request of frozen server 2 (06/24/2004). You will find several AltSysrq T, P, W and M. The server had 6 days of uptime and was running kernel 2.4.21-15.0.2.ELsmp.
Hi, lamentably i have a similar case, two Compaq Proliant ML-350 G3 with Smart Array 64xx (mirror disk), under Linux Enterprise Server 3.0, our system system freeze with intervals of 2,8, ? days , without apparent reason. I not run "hpasm" daemons, only run the services required(smtp,xinetd,ssh,ftp,poppassd). I upgrade to kernel 2.4.21-15.0.2.ELsmp, the last released in http://www.redhat.com/security/ , and upgrade Rom Flash components, nevertheless the problem persists (excuses by my badly english). At the moment we are contacting with Red Hat looking for the solution. I will thank for any commentary that can do to me. Thanks, César.
I have several of the HP DL380G3 boxes with 6G RAM. These boxes seem to hang with PAE enabled kernel (2.4.18-26.7.xbigmem, or custom 2.6.7). These boxes run fine with 2.4.18-26.7.xsmp or 2.4.21-15.ELsmp. (which limits the box to 4G of physical RAM usable by the kernel) I can usually trigger a hang in 5-20 minutes with: while(true);do make clean;make -j4 bzImage;done in a kernel tree.
Created attachment 102132 [details] Server 2 (dl580) freeze #13 with kernel 2.4.21-15.0.2 Hi Larry, Please find attached the console output for SysRq request of frozen server 2 (07/22/2004). The server had 30 days of uptime and was running kernel 2.4.21-15.0.2.ELsmp, note that this time the server has been up for a month, this may be due to the fact that students are on vacation and the server has lower load. Did the latest console outputs with several AltSysrq, help you to see what the processes are stuck on? If you need more info please tell us. Best regards, Juanjo
Correction: In #35 I stated that our DL380G3 boxes ran fine with 2.4.21-15.ELsmp. I have since discovered that they also hang running this kernel. It takes 2-10 hours to hang, but the box will almost always go down in less than 1 day.
I have had 8 crashes (and reboots from the HP PSP watchdog) in the last 12 hours. OS: RHEL3.0U2 HW: HP DL380G3 I caught one of the oops/panic reports before the watchdog rebooted the box... kernel BUG at journal.c:406! invalid operand: 0000 soundcore cpqasm cpqevt lp parport autofs audit 8021q bcm5700 floppy sg microcode keybdev mousedev hid input usb-ohci usbcore ext3 jbd cciss sd_mod scsi_mod CPU: 3 EIP: 0060:[<f884b9fa>] Tainted: P EFLAGS: 00010286 EIP is at journal_write_metadata_buffer [jbd] 0x38a (2.4.21-15.ELsmp/i686) eax: 00000068 ebx: f4347390 ecx: 00000001 edx: c037ae94 esi: 00000000 edi: f7ec3540 ebp: 0000000d esp: f6071e38 ds: 0068 es: 0068 ss: 0068 Process kjournald (pid: 23, stackpage=f6071000) Stack: f884f17c f884dfc4 f884dffe 00000196 f884dfe2 f884bbe8 00000000 00000000 f4347390 00000000 f7ec3540 0000000d f8848ad9 f2cab500 f4347390 f6071e98 00001d69 f60701c0 f6ba0a94 00000000 00000f64 f2c2009c 00000011 f2cab500 Call Trace: [<f884f17c>] .rodata.str1.4 [jbd] 0xf48 (0xf6071e38) [<f884dfc4>] .rodata.str1.1 [jbd] 0x544 (0xf6071e3c) [<f884dffe>] .rodata.str1.1 [jbd] 0x57e (0xf6071e40) [<f884dfe2>] .rodata.str1.1 [jbd] 0x562 (0xf6071e48) [<f884bbe8>] journal_next_log_block [jbd] 0x48 (0xf6071e4c) [<f8848ad9>] journal_commit_transaction [jbd] 0xed9 (0xf6071e68) [<c0109c6c>] __switch_to [kernel] 0x16c (0xf6071ef8) [<c0125194>] context_switch [kernel] 0xa4 (0xf6071f20) [<c0123274>] schedule [kernel] 0x2f4 (0xf6071f3c) [<f884b51a>] kjournald [jbd] 0x17a (0xf6071fb0) [<f884b380>] commit_timeout [jbd] 0x0 (0xf6071fd4) [<f884b3a0>] kjournald [jbd] 0x0 (0xf6071fe4) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf6071ff0) Code: 0f 0b 96 01 fe df 84 f8 e9 a9 fc ff ff 89 f6 8d bc 27 00 00 Kernel panic: Fatal exception Unable to handle kernel NULL pointer dereference at virtual address 00000008 printing eip: c0142e10 *pde = 31b88001 *pte = 00000000
We have a Box with: Brand: HP Model: DL580 G2 CPU: 4 x Intel Xeon 2GHz Memory: 6GB Disks: 4 x 140GB SmartArray 642 Controler configured Array 5 Operating System: RedHat Enterprise Linux AS Version 3 with 2.4.21- 15.0.3.ELsmp Kernel Main Application: Oracle Database Server version Oracle9i Release 9.2.0.5.0 The server is running since june-08-2004 and we had many freezes (frequency of almost one per day). This bug (Bug # 122077) seems to be the same problem we have. All updates recommend by RedHat Network were applied however it freezes, and only respond ping. If you need more information I can send! Almir Alcides Pollnow Network Administrator TEKA S.A. ++55 47 3215132 Brazil
I see what the problem is here. One process ends up calling getdqbuf indirectly via open when disk quotas are in use. getdqbuf() downs the dqio_sem semaphore and attempts to allocate a dqbuf. The allocation calls wakeup_kswapd() and blocks because the system is very low on memory. kswapd then wakesup, calls dqput() indirectly through prune_icache and downs the same dqio_sem semaphore. At this time the system is deadlocked! This patch will prevent wakeup_kswapd from blocking, therefore the system will not deadlock. ******************************************************************** --- linux-2.4.21/fs/quota_v2.c.orig 2004-07-30 13:31:56.000000000 -0400 +++ linux-2.4.21/fs/quota_v2.c 2004-07-30 13:32:12.000000000 -0400 @@ -128,7 +128,7 @@ static dqbuf_t getdqbuf(void) { - dqbuf_t buf = kmalloc(V2_DQBLKSIZE, GFP_KERNEL); + dqbuf_t buf = kmalloc(V2_DQBLKSIZE, GFP_NOFS); if (!buf) printk(KERN_WARNING "VFS: Not enough memory for quota buffers.\n"); return buf; ********************************************************************** The kernel with this fix can be downloaded form here: http://people.redhat.com/~lwoodman/.RHEL3/ Please test out the kernel ane let me know how it goes. Larry Woodman
Hi Larry, I have reviewed the attached SysRq console outputs and I have found the scenario you describe (a client process ---local, useradd, imap, quota, etc--- calling getdqbuf() and kswapd calling dqput()) in almost(*) server freezes. I will update both servers to 2.4.21-18.dq this week, but the load on these servers will be very low until mid september and, with this load, we had only a freeze in the past 30 days ... Best regards, Juanjo (*) As stated in comment #17 "Server 2 (dl580) freeze #8" was a different problem already fixed un U2.
Created attachment 102382 [details] Server 2 (dl580) freeze #14 with kernel 2.4.21-15.0.3 Hi Larry, Please find attached the console output for SysRq request of frozen server 2 (08/03/2004). The server had 11 days of uptime and was running kernel 2.4.21-15.0.3.ELsmp, I have found the getdqbuf() / dqput() scenario in the SysRq task list. Both servers are now running kernel 2.4.21-18.dq.ELsmp. Larry, can you made available the associated "kernel-source" and "kernel-hugemem" ? I need the first one to recompile fujitsu-siemens agents for server 1, and the second package to test it in a NFS client affected by bugzilla #118839 ... if this is not possible I will try to patch kernel-2.4.21-18.EL available from "Red Hat Enterprise Linux AS (v. 3 for x86) Beta" channel. Best regards, Juanjo
The builds you requested are underway, I'll put them in my people page location as soon as they are complete and update this bug. Larry
All set Juanjo, however I moved the kernels for you to avoid any confusion. Please grab them form here: http://people.redhat.com/~lwoodman/.bug122077/ Larry Woodman
Created attachment 102464 [details] Server 2 (dl580) freeze #14 with kernel 2.4.21-18.dq Hi Larry, Please find attached the console output for SysRq request of frozen server 2 (08/05/2004). The server had 2 days of uptime and was running kernel 2.4.21-18.dq.ELsmp, this freeze is very different (with respect to getdqbuf related freezes), existing interactive sessions appeared to be responsive but unable to execute any command, SysRq tErm killed some processes, and SysRq Unmount successfully remounted all filesystems and almost all tasks look like this: Call Trace: [<c0123e14>] schedule [kernel] 0x2f4 (0xf5959e48) [<c010adb3>] __down [kernel] 0x73 (0xf5959e8c) [<c010af5c>] __down_failed [kernel] 0x8 (0xf5959ec0) [<c0175c86>] .text.lock.namei [kernel] 0x35 (0xf5959ed0) [<c017204c>] link_path_walk [kernel] 0x45c (0xf5959ef0) [<c0172599>] path_lookup [kernel] 0x39 (0xf5959f30) [<c0172b5e>] open_namei [kernel] 0x7e (0xf5959f40) [<c0162333>] filp_open [kernel] 0x43 (0xf5959f70) [<c0162763>] sys_open [kernel] 0x53 (0xf5959fa8) I have switched this server to the latest stable errata kernel, 2.4.21-15.0.4.ELsmp and, if you think is a good idea, I will try to patch this kernel with your .dq. patch and update both servers to this patched kernel. Best regards, Juanjo
Juanjo, can you let the system get back into the above state ang get me one AltSysrq-T followed by one AltSysrq-M. There is so much "stuff" in the attachment that I am having trouble determining which tracebacks are in the same AltSysrq-T and which are duplicates. Thanks, Larry
Larry, I don't see the problem ... .... in the attached file there are other AltSysrq than "T" and "M", but you can find an AltSysrq-M on lines 84 to 107 and an AltSysrq-T on lines 163 to 4208 (I have counted 393 tasks, note that there is a lot of "chkpwdd" because is a daemon that forks to process each incoming request, the same applies to "xinetd" ---used for pop[s] and imap[s]--), on line 4211 there is an AltSysrq-E, on lines 4214 to 4236 there is other AltSysrq-M and on lines 4303 to 7579 (now there are only 294 tasks and 236 of them have "link_path_walk" in their "Call Trace" I don't know if it is related to the hang). If you want I can attach those AltSysrq as independent files ... Regarding to getting back te system in this state, please tell me if it is absolutely necessary, remember that this is a production server (I have switched to 2.4.21-15.0.4 because it is a security errata kernel). Regards, Juanjo
Hi, Larry, Have you put the related kernel source on your peopel page? I need it to build the lpfcdd driver. If you do, it will be a great help. Thanks! Calvin
Sorry for the delay on this Calvin, I thought I did put the src.rpm there! Its there now, please let me know how this works for you, I need conformation that it does fix your problem. Now Juanjo is hitting another deadlock which I havent figured out yet. Please let me know if you hit that one as well. http://people.redhat.com/~lwoodman/.bug122077/kernel-2.4.21-18.dq.EL.src.rpm Thanks, Larry Woodman
Created attachment 102749 [details] Server 1 (n400) freeze #5 with kernel 2.4.21-15.0.4.dq Hi Larry, Please find attached the console output for SysRq requests (T+M+U+B) of frozen server 1 (08/14/2004). Due to the release of security errata kernel 2.4.21-15.0.4, I switched both servers to this kernel, then I patched this kernel with your 'dq' patch and switched both servers to 2.4.21-15.0.4.dq.ELsmp. Server 1 was hit by a deadlock after 3 days of uptime. Best regards, Juanjo
Larry, I applied the source code from http://people.redhat.com/~lwoodman/.bug122077/kernel-2.4.21- 18.dq.EL.src.rpm, and built modules from that source base. When I tried to install the modules, it came back with kernel version mismatch error and failed to install. Any suggestion? (My kernel is 2.4.21-18.dq.ELsmp) Thanks! Best Regards, Calvin
A fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.4.EL).
Hi All, Our system: DL380G3 (latest firmware) + RHEL ES3-U3 with kernel 2.4.21-20.ELsmp Periodic freeze, 12-18 days, with all previus version of rhel: only ping works and no error messages... This morning I was able to freeze (many times) the system by hand using memtester: http://www.qcc.ca/~charlesc/software/memtester/ I use memtester from DAG: http://dag.wieers.com/packages/memtester/ Command used (we have 2.5GB RAM): # memtester 2310 Memtester tries to mlock() memory and then the system freeze... I hope this could help.
Hi Ernie, I can't find kernel version 2.4.21-20.4.EL in RHN, could you make it available? I am very interested on it because it may avoid me to patch 2.4.21-20.EL with dq patch ... Best regards, Juanjo
Hello, Juanjo. The 2.4.21-20.4.EL kernel will never be available via RHN because it is an internal-to-Red-Hat Engineering build. The fix will eventually be released as part of Update 4, which is currently anticipated in December of this year.
Created attachment 103903 [details] Server 2 (dl580) freeze #16 with kernel 2.4.21-20.dq.EL Hi Larry, This morning we updated server 2 to RHEL3 U3 with a 2.4.21-20 kernel with dq patch, after 4 hours of uptime we have detected the following problems on the server: + Very high load average (80 when we were alerted about the problem). + Unable to open new ssh/console sessions. + On the existing sessions, some commands worked fine, but others (like top, ps and w) running very slowly but interruptabe vÃa ctrl+c. Please find attached the AltSysrq-T and AltSysrq-M console output at the moment we detected the problems. Stopping almost all services didn't help to recover the server, so we tried to "reboot" without success, so we issued a AltSysrq-U (all filesystems were successfully remounted R/O) followed by a AltSysrq-B to successfully reboot the server (I can attach the full console output if it helps). We have rebooted the server with the standard 2.4.21-20.EL kernel. Regards, Juanjo
Hi Juanjo, the problem is that several of the processes are blocked in wakeup_kswapd() when they shouldnt be. This is due to a bug we discovered in wakeup_kswapd(), that unfortunately wasnt discovered until RHEL3-U3 was out. Can you please try the appropriate kernel with the fix? Its located in: http://people.redhat.com/~lwoodman/.RHEL3/ Thank, Larry Woodman
Hi Larry, Could you also provide the .src.rpm? We also need to patch it with your 'dq' patch (if it isn't already included in 2.4.21-20.6)... Regards, Juanjo
Sorry Larry, I forgot to ask you if this bug is related to the amount of RAM installed on the server, we have other servers with 4GB or less running 2.4.21-20 for several days without problems ... Should we upgrade these servers to 2.4.21-20.6 too? Regards, Juanjo
Reverting to MODIFIED state, since the bug fix is in the U4 pool.
Created attachment 104030 [details] Server 2 (dl580) panic #1 with kernel 2.4.21-20.6.EL Hi Larry, After 36 hours of uptime with kernel 2.4.21-20.6 we have got the attached panic on server 2. We have switched to 2.4.21-15.0.4.dq. Regards, Juanjo
Hello, Juanjo. Please open up a new bugzilla for the oops you just documented in comment #62. This bug has already been used to track two separate problems (one fixed in U3, and the other committed to U4). Thanks. (Note to Larry: the changes you made to do_try_to_free_pages() were originally committed to -20.6.EL, and later updated in -20.7.EL, but I don't know whether this might be related to this latest kscand oops.)
Juanjo, I already fixed this panic. Please grab the the approriate kernel from here and rerun the test ASAP: >>>http://people.redhat.com/~lwoodman/.RHEL3/ Thanks, Larry
Reverting to MODIFIED state, again.
Hi Ernie, I have opened Bug #133971 to document the kernel oops. Regards, Juanjo
Thanks, Juanjo. Larry is on the case.
Created attachment 104509 [details] Server 1 (n400) freeze #6 with kernel 2.4.21-20.dq.EL
Created attachment 104510 [details] Server 1 (n400) freeze #7 with kernel 2.4.21-20.dq.EL
Hi Larry, Please find attached on Comments #68 & #69 the console output for SysRq requests (T+M) of frozen server 1 (on 09/27/2004 and 09/28/2004), the scenario was similar to the one documented on Comment #56, but I attached them in order to be sure isn't a different issue. Best regards, Juanjo
Hi Larry, I have 2 servers: HP DL380 with 2 x processors Intel Xeon 3Ghz 6Gb of memory RH3-Up3 (2.4.21-20.ELsmp) Storage EMC Clariion CX700 using Qlogic QLA2340 fibre Software: Oracle RAC (9.2.0.5.0) Oracle 9i (9.2.0.5.0) Oracle Collaboration Suite 9.0.4.1.0 I have the same symtom of hang after 20 hours running: ping ok, sysreq ok, but no login and no log messages. Exists any solution to the problem? Thank´s Paulo Vilhena
Juanjo, I believe the "freeze" problem that you were seeing was fixed in RHEL3-U4, specifically in kernel-2.4.21-20.6.EL with the incorrect blocking bug inside wakeup_kswapd() patch. Can you confirm this so we can close out this bug? Paulo, as far as your freeze after 20 hours is concerned the most likely cause is the "Storage EMC Clariion CX700 using Qlogic QLA2340 fibre". This has caused multiple unrelated hangs on other systems. Can you grab me AltSysrq-M, AltSysrq-W and AltSysrq-T outputs when the system get into that state so I can verify me suspicions? Larry Woodman
Hi Larry, Both servers are running 2.4.21-25.EL from U4 Beta and all the problems reported on this bug seem solved, so you can close it. Best regards, Juanjo
Larry, I´m trying to generate the requested information for you. Thank´s Paulo Vilhena
Larry, One question... If the problem is "Cariion+QLA2340", what you sugest ? I´m using the native drive of RH, not EMC drive. Thank´s Paulo Vilhena
Where can I get the 2.4.21-25.EL kernel from U4 Beta that Juanjo says works for him? These server freezes are getting out of hand. Thanks. EZ
It's in the arch-dependent beta channel on RHN, e.g., the one named "Red Hat Enterprise Linux AS (v. 3 for x86) Beta". We anticipate that U4 will be released next week. The final kernel version is 2.4.21-27.EL.
Thank you Ernie. I think we will wait until next week for U4 to come out and give that a shot. EZ
Created attachment 108309 [details] Alt+SysRq logs of Oberon server - Paulo Vilhena
Hi Larry, In december,01 you ask me about the logs of Alt+SysRq, to verify if the freeze of my server is about "EMC Clariion+QLA2340". The logs is uplodaded. Thank´s Paulo Vilhena
Hi larry: I have Red Hat Enterprise Linux AS release 3 (Taroon Update 3) Kernel 2.4.21-26.EL on an i686 and my server hangs with different intervals of stability but with the same symptoms: -all daemon dies and only kernel is up (ping,Sysrq, ..) This problem happen (95%) when i send one email to de mail list (the server have postfix + mailman and all members are locals ) I atach a SysRq (w,t,m) logs Regards
Created attachment 108631 [details] Sysrq Logs Sysrq logs (w,t,m)
Created attachment 108699 [details] top top logs
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html