Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 3 product line. The current stable release is 3.9. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 122077

Summary:

servers freeze (only respond to ping and sysrq) periodically

Product:

Red Hat Enterprise Linux 3

Reporter:

Juanjo Villaplana <villapla>

Component:

kernel

Assignee:

Larry Woodman <lwoodman>

Status:

CLOSED ERRATA

QA Contact:

Severity:

high

Docs Contact:

Priority:

medium

Version:

3.0

CC:

aap, avi, aviro, cgomez, enrico, ewhiting, k.georgiou, papaz, petrides, riel, sscchuang, tao, traverj

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2004-12-20 20:55:04 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Server 1 (n400) freeze #3	none
Server 2 (dl580) freeze #4	none
Server 2 (dl580) freeze #5	none
Server 2 (dl580) freeze #7	none
Server 2 (dl580) freeze #8 (with comments)	none
Server 2 (dl580) memory usage until freeze #8	none
showMem for loopback interface troubles	none
'lp' strace output for loopback interface troubles	none
Server 2 (dl580) freeze #9 with latest kernel	none
Server 2 (dl580) freeze #10 with latest kernel	none
Server 1 (n400) freeze #4	none
Server 2 (dl580) freeze #11 with PAE-disabled kernel	none
Server 2 (dl580) freeze #12 with latest kernel	none
Server 2 (dl580) freeze #13 with kernel 2.4.21-15.0.2	none
Server 2 (dl580) freeze #14 with kernel 2.4.21-15.0.3	none
Server 2 (dl580) freeze #14 with kernel 2.4.21-18.dq	none
Server 1 (n400) freeze #5 with kernel 2.4.21-15.0.4.dq	none
Server 2 (dl580) freeze #16 with kernel 2.4.21-20.dq.EL	none
Server 2 (dl580) panic #1 with kernel 2.4.21-20.6.EL	none
Server 1 (n400) freeze #6 with kernel 2.4.21-20.dq.EL	none
Server 1 (n400) freeze #7 with kernel 2.4.21-20.dq.EL	none
Alt+SysRq logs of Oberon server - Paulo Vilhena	none
Sysrq Logs	none
top	none

Description Juanjo Villaplana 2004-04-30 08:45:16 UTC

Description of problem:

We have two *production* servers (1,2) running RHEL AS3 Update1 with
all errata packages installed.

Server 1 was running RHL 7.3 until dec. 30 2003, and server 2 was
running RHL 9 until march 25 2004, both servers were stable.

Since the upgrade to RHEL both servers have been suffering periodical
hangs with different intervals of stability but with the same symptoms:

+ The server stops responding to all services with no apparent
degradation. It still returns ping requests and responds to sysrq.

+ We can showMem, showPc, showTasks, shoWcpus

+ If we tErm or kIll only a few processes die

+ A request to Unmount can't Remount R/O all filesystems and always
stops remounting the same device (on both servers this is a +64GB ext3
filesystem with quotas enabled).

+ Finally we have to reBoot

Version-Release number of selected component (if applicable):

Currently 2.4.21-9.0.3.ELsmp, but same behaviour with 2.4.21-9.0.1.ELsmp

How reproducible:

n/a

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

(1) n400:
      + Fujitsu-Siemens PRYMERGY N400, 4 x Pentium III XEON 700MHz
      + 4 GB RAM
      + 4 x 36GB internal, SW RAID (md)
      + 8 x 36GB external, HW RAID on a Mylex AcceleRaid 352
      + Using LVM as volume manager, ext3 as filesystem with quotas
        enabled on user's filesystems.

(2) dl580:
      + HP Proliant DL580 G2, 4 x Intel Xeon MP 2GHz
      + 8GB RAM
      + 4 x 73GB internal HD, HW RAID on a SA 5i+
      + 8 x 146GB external HD, HW RAID on a SACS connected to a SA 532
      + Using LVM as volume manager, ext3 as filesystem with quotas
        enabled on user's filesystems.

Comment 1 Juanjo Villaplana 2004-04-30 10:38:40 UTC

Created attachment 99813 [details]
Server 1 (n400) freeze #3

Console output for SysRq request of frozen server 1.

We don't have the output associated to the previous two freezes.

Comment 2 Juanjo Villaplana 2004-04-30 11:24:57 UTC

Created attachment 99814 [details]
Server 2 (dl580) freeze #4

Console output for SysRq request of frozen server 2.

Comment 3 Juanjo Villaplana 2004-04-30 11:27:04 UTC

Created attachment 99815 [details]
Server 2 (dl580) freeze #5

Console output for SysRq request of frozen server 2.

Comment 4 Juanjo Villaplana 2004-04-30 11:35:19 UTC

Created attachment 99816 [details]
Server 2 (dl580) freeze #7

Console output for SysRq request of frozen server 2.

Note: on this console output showTasks has been called twice, before and after
trying to kill tasks (tErm, kIll).

Comment 6 Rik van Riel 2004-04-30 15:15:42 UTC

OK, looking at it some more I see a potential problem:

1) irqbalance opens /proc/interrupts, which ends up doing a GFP_KERNEL
allocation from interrupts_open()

2) kswapd dives into the filesystem code to free memory (in order to
satisfy the allocation)

Al, could this result in a locking problem?
Would it be better if interrupts_open() did its allocation with GFP_NOFS ?

Comment 7 Alexander Viro 2004-04-30 16:09:03 UTC

What locking problem?  Caller of interrupts_open() is not holding
any locks; for all practical purposes we are talking about
allocation in sys_open() and it's _definitely_ allowed to make
GFP_KERNEL allocations.

Comment 8 Rik van Riel 2004-04-30 17:03:43 UTC

Al, thanks for confirming that that's not the issue here. I'll take
another look at the traces to see if there's anything else suspicious...

Comment 9 Larry Woodman 2004-04-30 17:39:36 UTC

Juanjo, can you get the RHEL3-U2/update 2 kernel and try it?  I think
this problem has been fixed.  In the mean time, please try
"echo 30 > /proc/sys/vm/inactive_clean_percent" and re-run the workload.

Larry Woodman

Comment 10 Juanjo Villaplana 2004-05-03 08:47:25 UTC

Larry, as fas as I know RHEL3-U2 is still beta and
kernel-smp-2.4.21-14.EL.i686.rpm is vulnerable to RHSA-2004:183 and as
these servers allow interactive user access (student, faculty staff
...) we can't run a vulnerable kernel.

If all we need from RHEL3-U2 kernel is the new
"vm.inactive_clean_percent = 30" default, we can put this value in
sysctl.conf until RHEL3-U2 be released or a patched kernel is available.

Anyway I configured "vm.inactive_clean_percent = 30" last friday on
both servers.

Unfortunately, last saturday (*) froze, but the scenario slightly
changed, maybe due to the new inactive_clean_percent setting:

+ The server stops responding to all services with no apparent
degradation. It still returns ping requests, responds to sysrq and
agetty on serial console is able to spawn login and I could nearly log
as root (user/password accepted, motd showm but no shell prompt).

+ We can showMem, showPc, showTasks, shoWcpus

+ tErm was able to kill all user processes.

+ Then we logged in as root, took some additional info and rebooted
the server. Unfortunately the server stuck on a rc script, issued a
tErm again and finally the server got absolutely frozen (no ping, no
sysrq ... nothing).


(*) Since the server 2 upgrade, it has not been up for more than 6
days (indeed most freezes take exactly 6 days to spot) and it was
restarted april 25th.

Comment 11 Juanjo Villaplana 2004-05-03 08:51:33 UTC

Created attachment 99899 [details]
Server 2 (dl580) freeze #8   (with comments)

Server 2 (dl580) freeze #8

grep "^COMMENT:" for added comments to the console output

Comment 12 Juanjo Villaplana 2004-05-03 11:56:18 UTC

Created attachment 99907 [details]
Server 2 (dl580) memory usage until freeze #8

This is a modified "sar -r" output where the last column computes real memory
usage (in MB) as: memused - buffers - cached

You'll see how this value increases since 04/25/2004 09:50:00 AM (server
reboot) until 05/01/2004 07:57:13 PM (server freeze #8), we think this memory
consumption does not correspond to user processes usage, because this server
has a regular load due e-mail (SMTP, POP and IMAP) and a variable load
(interactive sessions, database sessions, smb disk access ...) that starts at
8:00 and ends at 22:00.

See also "free" and "ps -alfy" output in attachment "Server 2 (dl580) freeze #8
(with comments)", where the sum of RSS of all processes is far beyond the real
memory usage, we guess the kernel is using this memory, but it seems too much
memory usage for us.

Please note that "vm.inactive_clean_percent = 30" was set on 04/30/2004
10:30:00 PM approximately. We also have full sar statistics if more information
is required.

Comment 13 Juanjo Villaplana 2004-05-04 07:29:31 UTC

Created attachment 99943 [details]
showMem for loopback interface troubles

Comment 14 Juanjo Villaplana 2004-05-04 08:45:30 UTC

Comment on attachment 99943 [details]
showMem for loopback interface troubles

On april 22th we detected problems with CUPS and SMTP (postfix+amavis
sandwitch), at a first glance it seemed like two independent issues, although
both services use 127.0.0.1 for process communication (lp-cupsd,
postfix-amavis).

After some testing we detected that cups could print files <15KB (approx), but
'lp' always locked when trying to print bigger files.

Stracing 'lp' we noticed it was blocked receiving from 'cups' just after
sending the file being printed (look for strace output on next attachment).

As the default mtu for interface 'lo' is 16436 and a 15KB file fits on a single
package and a 16KB file (plus TCP/IP headers) don't, we tested with differents
values, and noticed that lowering lo MTU to 1500 solved the problem.

We don't know if this problem is related to the server freezes, but today it
has shown again on server 2 (it never happened to server 1) and, again, setting
lo MTU to 1500 has worked. We haven't rebooted the server, so if more testing
is needed or if a new bug has to be open for this issue please let us know.

Regards.

Comment 15 Juanjo Villaplana 2004-05-04 08:47:59 UTC

Created attachment 99949 [details]
'lp' strace output for loopback interface troubles

Comment 16 Larry Woodman 2004-05-04 14:50:34 UTC

Juanjo, what is running on this system?  The reason I ask is because
the memory allocation for lowmem looks rather strange, all of the
lowmem has been allocated to the pagecache and none has been allocated
to anonymous memory regions.  This is very unusual.

( Active: 440307/257888, inactive_laundry: 38729, inactive_clean:
38827, free: 1161926 )
  aa:0 ac:0 id:0 il:0 ic:0 fr:2942                                   
         
  aa:0 ac:109614 id:20627 il:3071 ic:3134 fr:1440
  aa:62489 ac:268204 id:237261 il:35658 ic:35693 fr:1157544


Larry Woodman

Comment 17 Larry Woodman 2004-05-04 15:06:47 UTC

Ah!!!, I know what the problem is here and I already fixed it in
RHEL3-U2.  We were not properly casting in in an unsigned char
within an if statement and that resulted in this exact system hang.

We need to get you running RHEL3-U2 asap!

Larry Woodman


******** patch to rebalance_laundry_zone that fixed the hang *********
-if (now - page->age > 30) {
+if ((unsigned char)(now - page->age) > 30) {


********************** hang traceback without this fix ***************
[<c0133db5>] schedule_timeout [kernel] 0x65 (0xc9083e38)
[<c0133d40>] process_timeout [kernel] 0x0 (0xc9083e58)
[<c0145b88>] wait_on_page_timeout [kernel] 0xa8 (0xc9083e70)
[<c0165b07>] try_to_free_buffers [kernel] 0x147 (0xc9083e94)
[<c0152ad8>] rebalance_laundry_zone [kernel] 0x218 (0xc9083eac)
[<c0155b1d>] __alloc_pages [kernel] 0x28d (0xc9083edc)
[<c0155c5c>] __get_free_pages [kernel] 0x1c (0xc9083f20)
[<c0125adf>] dup_task_struct [kernel] 0x5f (0xc9083f24)
[<c012642b>] copy_process [kernel] 0x7b (0xc9083f38)
[<c021da30>] sock_map_fd [kernel] 0x70 (0xc9083f40)
[<c0126f4e>] do_fork [kernel] 0x4e (0xc9083f68)
[<c0109d09>] sys_clone [kernel] 0x49 (0xc9083fa0)

Comment 18 Juanjo Villaplana 2004-05-04 16:16:57 UTC

Hi Larry, is there a chance to patch our current kernel
(2.4.21-9.0.3.ELsmp) with rebalance_laundry_zone patch, or to patch 
RHEL3-U2 (kernel-smp-2.4.21-14.EL) with ip_setsockopt security patch?

Both issues are very important for us.

Thanks.

Comment 19 Larry Woodman 2004-05-04 19:56:46 UTC

The very latest RHEL3-U2 kernel contains both patches you need.

Larry

Comment 20 Ernie Petrides 2004-05-04 20:13:36 UTC

Hello, Juanjo.  I just want to confirm that the latest RHEL3 U2
kernel, which is version 2.4.21-15.EL, contains the same security
errata fix to ip_setsockopt() that was released in RHSA-2004:183
(kernel version 2.4.21-9.0.3.EL), which you refer to in comments #10
and #18.  We intend to officially release the RHEL3 U2 errata next
week, and its advisory id is RHSA-2004:188.  (It's currently at the
end of the external beta testing period.)

Cheers.  -ernie

Comment 21 Juanjo Villaplana 2004-05-05 07:01:33 UTC

Hi Larry, Ernie. Thanks for your fast response, but I am unable to
find neither kernel 2.4.21-15.EL nor advisory RHSA-2004:188 at
rhn.redhat.com.

The current kernel on "Red Hat Enterprise Linux AS (v. 3 for x86)
Beta" is 2.4.21-14.EL I guess we will have to wait until you
officially release the RHEL3 U2 errata, is this true?

Best regards, Juanjo.

Comment 22 Ernie Petrides 2004-05-05 20:43:14 UTC

Hi, Juanjo.  I have just verified with our RHN team that the -15.EL
kernel (respun 2 weeks ago for the security issue) was intentionally
not pushed into the beta channel (for obscure process reasons).  But
it is scheduled to be pushed into the main update channel on Monday,
after which time you should be able to upgrade via RHSA-2004:188.

If you are unable to access RHSA-2004:188 by Tuesday, March 11th,
please feel free to contact me for a status update.

Cheers.  -ernie

Comment 23 John Flanagan 2004-05-12 01:08:50 UTC

An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-188.html

Comment 24 Juanjo Villaplana 2004-05-20 07:13:11 UTC

Created attachment 100368 [details]
Server 2 (dl580) freeze #9 with latest kernel

Hi,

We are reopening this bug because after upgrading our servers to RHEL AS 3
Update 2 we have experienced again the problems that led us to open this bug.

Server 2 (dl580) was running 2.4.21-15.ELsmp since May 13th and this server has
frozen after 7 days of uptime (note that 7 days was the usual uptime between
freezes with the previous kernel) with the same sympthoms (only responds to
ping and sysrq). You will find attached the console output for the sysrq
requests (Memory, Tasks, Umount ...).


As we noted previously, this server was running RHL 9 until march 25 2004 and I
don't know if this is relevant but with RHL9 we experienced similar periodic
freezes (in that case we were able to kill all tasks from sysrq, log in the
system and reboot) and the workaround was to use a PAE-disabled kernel, so we
switched from kernel-bigmem to kernel-smp and got a rock-solid server at the
expense of losing 4GB.

With RHEL 3 kernel-smp already has PAE enabled so we can't test the same
workaround, but we are thinking about modifying kernel-2.4.21-15.EL.src.rpm to
generate an additional kernel-smp4g package with CONFIG_HIGHMEM4G=y instead of
CONFIG_HIGHMEM64G=y.

What do you think about that?


Best regards

Comment 27 Larry Woodman 2004-06-17 14:55:04 UTC

This is not a memory problem, none of the zones are all that badly
depleted.  Please get me several AltSysrq T, P, W and M outputs when
the system is in this state so I can see exactly what the processes
are stuck on.

Larry Woodman

Comment 28 Juanjo Villaplana 2004-06-18 07:40:20 UTC

Created attachment 101240 [details]
Server 2 (dl580) freeze #10 with latest kernel

Console output for SysRq request of frozen server 2 (05/25/2004).

The server had 5 days of uptime.

Comment 29 Juanjo Villaplana 2004-06-18 07:44:31 UTC

Created attachment 101241 [details]
Server 1 (n400) freeze #4

Server 1 (n400) freeze #4 with kernel 2.4.21-15.ELsmp.

Console output for SysRq request of frozen server 1 (14/06/2004).

The server had 27 days of uptime.

Comment 30 Juanjo Villaplana 2004-06-18 07:48:39 UTC

Hi Larry,

I have attached the console output for the two last freezes. As usual,
you will find AltSysrq T, P, W and M before and after AltSysrq E and I.

Please tell me if this is not what you need.

Regards,
              Juanjo

Comment 31 Juanjo Villaplana 2004-06-18 10:40:29 UTC

Created attachment 101243 [details]
Server 2 (dl580) freeze #11 with PAE-disabled kernel

Console output for SysRq request of frozen server 2 (06/18/2004).

You will find two AltSysrq T, P, W and M before AltSysrq I, and two after.

The server had 2 days of uptime and was running a modified version of
2.4.21-15.ELsmp with CONFIG_HIGHMEM4G=y. It was previously running with this
kernel during 18 days with no problems (it was reboot for maintenance reasons).


Prior to the freeze, the server had a load average of 24.318.

Now the server is running the lastest errata kernel (2.4.21-15.0.2.ELsmp not
modified).

Comment 32 EZ 2004-06-23 12:33:43 UTC

I am having the same problems with RH9.0 kernel 2.4.20-31.9 bigmem.  
Running on Compaq DL380's.  Seems to occur when we get high ftp 
uploads or high user processing.

papaz

Comment 33 Juanjo Villaplana 2004-06-24 10:10:22 UTC

Created attachment 101370 [details]
Server 2 (dl580) freeze #12 with latest kernel

Console output for SysRq request of frozen server 2 (06/24/2004).

You will find several AltSysrq T, P, W and M.

The server had 6 days of uptime and was running kernel 2.4.21-15.0.2.ELsmp.

Comment 34 Cesar B 2004-06-28 14:55:27 UTC

Hi, lamentably i have a similar case, two Compaq Proliant ML-350 G3 
with Smart Array 64xx (mirror disk), under Linux Enterprise Server 
3.0, our system system freeze with intervals of 2,8, ?  days , 
without apparent reason. I not run "hpasm" daemons, only run the 
services required(smtp,xinetd,ssh,ftp,poppassd).
I upgrade to kernel 2.4.21-15.0.2.ELsmp, the last released in 
http://www.redhat.com/security/ , and upgrade Rom Flash components, 
nevertheless the problem persists (excuses by my badly english).

At the moment we are contacting with Red Hat looking for the solution.
 
I will thank for any commentary that can do to me.

Thanks, CÃ©sar.

Comment 35 Eric Whiting 2004-07-16 22:20:58 UTC

I have several of the HP DL380G3 boxes with 6G RAM.

These boxes seem to hang with PAE enabled kernel (2.4.18-26.7.xbigmem,
or custom 2.6.7).

These boxes run fine with 2.4.18-26.7.xsmp or 2.4.21-15.ELsmp.  (which
limits the box to 4G of physical RAM usable by the kernel)

I can usually trigger a hang in 5-20 minutes with: while(true);do make
clean;make -j4 bzImage;done in a kernel tree.

Comment 36 Juanjo Villaplana 2004-07-22 06:58:04 UTC

Created attachment 102132 [details]
Server 2 (dl580) freeze #13 with kernel 2.4.21-15.0.2

Hi Larry,

Please find attached the console output for SysRq request of frozen server 2
(07/22/2004).

The server had 30 days of uptime and was running kernel 2.4.21-15.0.2.ELsmp,
note that this time the server has been up for a month, this may be due to the
fact that students are on vacation and the server has lower load.

Did the latest console outputs with several AltSysrq, help you to see what the
processes are stuck on? If you need more info please tell us.

Best regards,
		 Juanjo

Comment 37 Eric Whiting 2004-07-22 14:18:14 UTC

Correction: In #35 I stated that our DL380G3 boxes ran fine with
2.4.21-15.ELsmp. I have since discovered that they also hang running
this kernel.  It takes 2-10 hours to hang, but the box will almost
always go down in less than 1 day.

Comment 38 Eric Whiting 2004-07-23 16:08:38 UTC

I have had 8 crashes (and reboots from the HP PSP watchdog) in the
last 12 hours.

OS: RHEL3.0U2 
HW: HP DL380G3

I caught one of the oops/panic reports before the watchdog rebooted
the box...

kernel BUG at journal.c:406!
invalid operand: 0000
soundcore cpqasm cpqevt lp parport autofs audit 8021q bcm5700 floppy
sg microcode keybdev mousedev hid input usb-ohci usbcore ext3 jbd
cciss sd_mod scsi_mod
CPU:    3
EIP:    0060:[<f884b9fa>]    Tainted: P
EFLAGS: 00010286

EIP is at journal_write_metadata_buffer [jbd] 0x38a (2.4.21-15.ELsmp/i686)
eax: 00000068   ebx: f4347390   ecx: 00000001   edx: c037ae94
esi: 00000000   edi: f7ec3540   ebp: 0000000d   esp: f6071e38
ds: 0068   es: 0068   ss: 0068
Process kjournald (pid: 23, stackpage=f6071000)
Stack: f884f17c f884dfc4 f884dffe 00000196 f884dfe2 f884bbe8 00000000
00000000
      f4347390 00000000 f7ec3540 0000000d f8848ad9 f2cab500 f4347390
f6071e98
      00001d69 f60701c0 f6ba0a94 00000000 00000f64 f2c2009c 00000011
f2cab500
Call Trace:   [<f884f17c>] .rodata.str1.4 [jbd] 0xf48 (0xf6071e38)
[<f884dfc4>] .rodata.str1.1 [jbd] 0x544 (0xf6071e3c)
[<f884dffe>] .rodata.str1.1 [jbd] 0x57e (0xf6071e40)
[<f884dfe2>] .rodata.str1.1 [jbd] 0x562 (0xf6071e48)
[<f884bbe8>] journal_next_log_block [jbd] 0x48 (0xf6071e4c)
[<f8848ad9>] journal_commit_transaction [jbd] 0xed9 (0xf6071e68)
[<c0109c6c>] __switch_to [kernel] 0x16c (0xf6071ef8)
[<c0125194>] context_switch [kernel] 0xa4 (0xf6071f20)
[<c0123274>] schedule [kernel] 0x2f4 (0xf6071f3c)
[<f884b51a>] kjournald [jbd] 0x17a (0xf6071fb0)
[<f884b380>] commit_timeout [jbd] 0x0 (0xf6071fd4)
[<f884b3a0>] kjournald [jbd] 0x0 (0xf6071fe4)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf6071ff0)

Code: 0f 0b 96 01 fe df 84 f8 e9 a9 fc ff ff 89 f6 8d bc 27 00 00

Kernel panic: Fatal exception
Unable to handle kernel NULL pointer dereference at virtual address
00000008
printing eip:
c0142e10
*pde = 31b88001
*pte = 00000000

Comment 39 Almir Pollnow 2004-07-28 18:11:10 UTC

We have a Box with:
Brand: HP
Model: DL580 G2
CPU: 4 x Intel Xeon 2GHz
Memory: 6GB
Disks: 4 x 140GB SmartArray 642 Controler configured Array 5

Operating System: RedHat Enterprise Linux AS Version 3 with 2.4.21-
15.0.3.ELsmp Kernel
Main Application: Oracle Database Server version Oracle9i Release 
9.2.0.5.0 

The server is running since june-08-2004 and we had many freezes 
(frequency of almost one per day).
This bug (Bug # 122077) seems to be the same problem we have. All 
updates recommend by RedHat Network were applied however it freezes, 
and only respond ping.

If you need more information I can send!

Almir Alcides Pollnow
Network Administrator
TEKA S.A.
++55 47 3215132
Brazil

Comment 40 Larry Woodman 2004-07-30 18:05:05 UTC

I see what the problem is here.  One process ends up calling getdqbuf
indirectly via open when disk quotas are in use.  getdqbuf() downs the
dqio_sem semaphore and attempts to allocate a dqbuf.  The allocation
calls wakeup_kswapd() and blocks because the system is very low on
memory.  kswapd then wakesup, calls dqput() indirectly through
prune_icache and downs the same dqio_sem semaphore.  At this time the
system is deadlocked!

This patch will prevent wakeup_kswapd from blocking, therefore the
system will not deadlock.
********************************************************************
--- linux-2.4.21/fs/quota_v2.c.orig     2004-07-30 13:31:56.000000000
-0400
+++ linux-2.4.21/fs/quota_v2.c  2004-07-30 13:32:12.000000000 -0400
@@ -128,7 +128,7 @@
                                                                     
                                   
 static dqbuf_t getdqbuf(void)
 {
-       dqbuf_t buf = kmalloc(V2_DQBLKSIZE, GFP_KERNEL);
+       dqbuf_t buf = kmalloc(V2_DQBLKSIZE, GFP_NOFS);
        if (!buf)
                printk(KERN_WARNING "VFS: Not enough memory for quota
buffers.\n");
        return buf;
**********************************************************************

The kernel with this fix can be downloaded form here:

http://people.redhat.com/~lwoodman/.RHEL3/

Please test out the kernel ane let me know how it goes.

Larry Woodman

Comment 41 Juanjo Villaplana 2004-08-02 12:08:04 UTC

Hi Larry,

I have reviewed the attached SysRq console outputs and I have found
the scenario you describe (a client process ---local, useradd, imap,
quota, etc--- calling getdqbuf() and kswapd calling dqput()) in
almost(*) server freezes.

I will update both servers to 2.4.21-18.dq this week, but the load on
these servers will be very low until mid september and, with this
load, we had only a freeze in the past 30 days ...

Best regards,
                  Juanjo

(*) As stated in comment #17 "Server 2 (dl580) freeze #8" was a
different problem already fixed un U2.

Comment 42 Juanjo Villaplana 2004-08-03 07:54:28 UTC

Created attachment 102382 [details]
Server 2 (dl580) freeze #14 with kernel 2.4.21-15.0.3

Hi Larry,

Please find attached the console output for SysRq request of frozen server 2
(08/03/2004).

The server had 11 days of uptime and was running kernel 2.4.21-15.0.3.ELsmp,
I have found the getdqbuf() / dqput() scenario in the SysRq task list.

Both servers are now running kernel 2.4.21-18.dq.ELsmp.

Larry, can you made available the associated "kernel-source" and
"kernel-hugemem" ?

I need the first one to recompile fujitsu-siemens agents for server 1, and the
second package to test it in a NFS client affected by bugzilla #118839 ... if
this is not possible I will try to patch kernel-2.4.21-18.EL available from
"Red Hat Enterprise Linux AS (v. 3 for x86) Beta" channel.

Best regards,
		 Juanjo

Comment 43 Larry Woodman 2004-08-03 21:17:19 UTC

The builds you requested are underway, I'll put them in my people page
location as soon as they are complete and update this bug.

Larry

Comment 44 Larry Woodman 2004-08-04 14:19:20 UTC

All set Juanjo, however I moved the kernels for you to avoid any
confusion.  Please grab them form here:

http://people.redhat.com/~lwoodman/.bug122077/


Larry Woodman

Comment 45 Juanjo Villaplana 2004-08-05 16:39:38 UTC

Created attachment 102464 [details]
Server 2 (dl580) freeze #14 with kernel 2.4.21-18.dq

Hi Larry,

Please find attached the console output for SysRq request of frozen server 2
(08/05/2004).

The server had 2 days of uptime and was running kernel 2.4.21-18.dq.ELsmp,
this freeze is very different (with respect to getdqbuf related freezes),
existing interactive sessions appeared to be responsive but unable to execute
any command, SysRq tErm killed some processes, and SysRq Unmount successfully
remounted all filesystems and almost all tasks look like this:

Call Trace:   [<c0123e14>] schedule [kernel] 0x2f4 (0xf5959e48)
[<c010adb3>] __down [kernel] 0x73 (0xf5959e8c)
[<c010af5c>] __down_failed [kernel] 0x8 (0xf5959ec0)
[<c0175c86>] .text.lock.namei [kernel] 0x35 (0xf5959ed0)
[<c017204c>] link_path_walk [kernel] 0x45c (0xf5959ef0)
[<c0172599>] path_lookup [kernel] 0x39 (0xf5959f30)
[<c0172b5e>] open_namei [kernel] 0x7e (0xf5959f40)
[<c0162333>] filp_open [kernel] 0x43 (0xf5959f70)
[<c0162763>] sys_open [kernel] 0x53 (0xf5959fa8)

I have switched this server to the latest stable errata kernel,
2.4.21-15.0.4.ELsmp and, if you think is a good idea, I will try to patch this
kernel with your .dq. patch and update both servers to this patched kernel.

Best regards,
		 Juanjo

Comment 46 Larry Woodman 2004-08-05 18:00:22 UTC

Juanjo, can you let the system get back into the above state ang get
me one AltSysrq-T followed by one AltSysrq-M.  There is so much
"stuff" in the attachment that I am having trouble determining which
tracebacks are  in the same AltSysrq-T and which are duplicates.

Thanks, Larry

Comment 47 Juanjo Villaplana 2004-08-05 19:05:21 UTC

Larry, I don't see the problem ...

.... in the attached file there are other AltSysrq than "T" and "M",
but you can find an AltSysrq-M on lines 84 to 107 and an AltSysrq-T on
lines 163 to 4208 (I have counted 393 tasks, note that there is a lot
of "chkpwdd" because is a daemon that forks to process each incoming
request, the same applies to "xinetd" ---used for pop[s] and
imap[s]--), on line 4211 there is an AltSysrq-E, on lines 4214 to 4236
there is other AltSysrq-M and on lines 4303 to 7579 (now there are
only 294 tasks and 236 of them have "link_path_walk" in their "Call
Trace" I don't know if it is related to the hang).

If you want I can attach those AltSysrq as independent files ...

Regarding to getting back te system in this state, please tell me if
it is absolutely necessary, remember that this is a production server
(I have switched to 2.4.21-15.0.4 because it is a security errata kernel).

Regards,
               Juanjo

Comment 48 Calvin Chuang 2004-08-13 17:26:14 UTC

Hi, Larry,

Have you put the related kernel source on your peopel page?  I need 
it to build the lpfcdd driver.  If you do, it will be a great help.

Thanks!
Calvin

Comment 49 Larry Woodman 2004-08-13 17:37:36 UTC

Sorry for the delay on this Calvin, I thought I did put the src.rpm
there!  Its there now, please let me know how this works for you, I
need conformation that it does fix your problem.  Now Juanjo is
hitting another deadlock which I havent figured out yet.  Please let
me know if you hit that one as well.

http://people.redhat.com/~lwoodman/.bug122077/kernel-2.4.21-18.dq.EL.src.rpm

Thanks, Larry Woodman

Comment 50 Juanjo Villaplana 2004-08-15 18:30:06 UTC

Created attachment 102749 [details]
Server 1 (n400) freeze #5 with kernel 2.4.21-15.0.4.dq

Hi Larry,

Please find attached the console output for SysRq requests (T+M+U+B) of frozen
server 1 (08/14/2004).

Due to the release of security errata kernel 2.4.21-15.0.4, I switched both
servers to this kernel, then I patched this kernel with your 'dq' patch and
switched both servers to 2.4.21-15.0.4.dq.ELsmp.

Server 1 was hit by a deadlock after 3 days of uptime.

Best regards,
		 Juanjo

Comment 51 Calvin Chuang 2004-08-17 16:19:31 UTC

Larry,

I applied the source code from 
http://people.redhat.com/~lwoodman/.bug122077/kernel-2.4.21-
18.dq.EL.src.rpm, and built modules from that source base.  When I 
tried to install the modules, it came back with kernel version 
mismatch error and failed to install.  Any suggestion? (My kernel is 
2.4.21-18.dq.ELsmp)  Thanks!

Best Regards,
Calvin

Comment 52 Ernie Petrides 2004-09-10 01:06:38 UTC

A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.4.EL).

Comment 53 Enrico Ardizzoni 2004-09-13 11:32:30 UTC

Hi All,

Our system:

DL380G3 (latest firmware) + RHEL ES3-U3 with kernel 2.4.21-20.ELsmp

Periodic freeze, 12-18 days, with all previus version of rhel: only
ping works and no error messages... 

This morning I was able to freeze (many times) the system by hand
using memtester:

 http://www.qcc.ca/~charlesc/software/memtester/

I use memtester from DAG:

 http://dag.wieers.com/packages/memtester/

Command used (we have 2.5GB RAM):

# memtester 2310

Memtester tries to mlock() memory and then the system freeze...

I hope this could help.

Comment 54 Juanjo Villaplana 2004-09-13 16:46:53 UTC

Hi Ernie,

I can't find kernel version 2.4.21-20.4.EL in RHN, could you make it
available? I am very interested on it because it may avoid me to patch
2.4.21-20.EL with dq patch ...

Best regards,
                Juanjo

Comment 55 Ernie Petrides 2004-09-13 20:32:30 UTC

Hello, Juanjo.  The 2.4.21-20.4.EL kernel will never be available
via RHN because it is an internal-to-Red-Hat Engineering build.
The fix will eventually be released as part of Update 4, which
is currently anticipated in December of this year.

Comment 56 Juanjo Villaplana 2004-09-16 11:28:53 UTC

Created attachment 103903 [details]
Server 2 (dl580) freeze #16 with kernel 2.4.21-20.dq.EL

Hi Larry,

This morning we updated server 2 to RHEL3 U3 with a 2.4.21-20 kernel with dq
patch, after 4 hours of uptime we have detected the following problems on the
server:

    + Very high load average (80 when we were alerted about the problem).

    + Unable to open new ssh/console sessions.

    + On the existing sessions, some commands worked fine, but others (like
top, ps and w) running very slowly but interruptabe vÃa ctrl+c.

Please find attached the AltSysrq-T and AltSysrq-M console output at the moment
we detected the problems.

Stopping almost all services didn't help to recover the server, so we tried to
"reboot" without success, so we issued a AltSysrq-U (all filesystems were
successfully remounted R/O) followed by a AltSysrq-B to successfully reboot the
server (I can attach the full console output if it helps).

We have rebooted the server with the standard 2.4.21-20.EL kernel.

Regards,
	   Juanjo

Comment 57 Larry Woodman 2004-09-16 14:27:55 UTC

Hi Juanjo, the problem is that several of the processes are blocked in
wakeup_kswapd() when they shouldnt be.  This is due to a bug we
discovered in wakeup_kswapd(), that unfortunately wasnt discovered
until RHEL3-U3 was out.  Can you please try the appropriate kernel
with the fix?  Its located in:

http://people.redhat.com/~lwoodman/.RHEL3/


Thank, Larry Woodman

Comment 58 Juanjo Villaplana 2004-09-16 18:37:11 UTC

Hi Larry,

Could you also provide the .src.rpm? We also need to patch it with
your 'dq' patch (if it isn't already included in 2.4.21-20.6)...

Regards,
               Juanjo

Comment 59 Juanjo Villaplana 2004-09-16 18:43:12 UTC

Sorry Larry, I forgot to ask you if this bug is related to the amount
of RAM installed on the server, we have other servers with 4GB or less
running 2.4.21-20 for several days without problems ... Should we
upgrade these servers to 2.4.21-20.6 too?

Regards,
                Juanjo

Comment 60 Ernie Petrides 2004-09-16 20:34:47 UTC

Reverting to MODIFIED state, since the bug fix is in the U4 pool.

Comment 62 Juanjo Villaplana 2004-09-20 20:57:15 UTC

Created attachment 104030 [details]
Server 2 (dl580) panic #1 with kernel 2.4.21-20.6.EL

Hi Larry,

After 36 hours of uptime with kernel 2.4.21-20.6 we have got the attached panic
on server 2.

We have switched to 2.4.21-15.0.4.dq.

Regards,
	    Juanjo

Comment 63 Ernie Petrides 2004-09-20 23:16:24 UTC

Hello, Juanjo.  Please open up a new bugzilla for the oops you just
documented in comment #62.  This bug has already been used to track
two separate problems (one fixed in U3, and the other committed to U4).

Thanks.

(Note to Larry: the changes you made to do_try_to_free_pages() were
 originally committed to -20.6.EL, and later updated in -20.7.EL, but
 I don't know whether this might be related to this latest kscand oops.)

Comment 64 Larry Woodman 2004-09-21 14:19:03 UTC

Juanjo, I already fixed this panic.  Please grab the the approriate
kernel from here and rerun the test ASAP:

>>>http://people.redhat.com/~lwoodman/.RHEL3/


Thanks, Larry

Comment 65 Ernie Petrides 2004-09-22 01:26:28 UTC

Reverting to MODIFIED state, again.

Comment 66 Juanjo Villaplana 2004-09-28 19:06:36 UTC

Hi Ernie,

I have opened Bug #133971 to document the kernel oops.

Regards,
               Juanjo

Comment 67 Ernie Petrides 2004-09-29 02:26:01 UTC

Thanks, Juanjo.  Larry is on the case.

Comment 68 Juanjo Villaplana 2004-09-29 15:00:20 UTC

Created attachment 104509 [details]
Server 1 (n400) freeze #6 with kernel 2.4.21-20.dq.EL

Comment 69 Juanjo Villaplana 2004-09-29 15:02:32 UTC

Created attachment 104510 [details]
Server 1 (n400) freeze #7 with kernel 2.4.21-20.dq.EL

Comment 70 Juanjo Villaplana 2004-09-29 15:19:29 UTC

Hi Larry,

Please find attached on Comments #68 & #69 the console output for
SysRq requests (T+M) of frozen server 1 (on 09/27/2004 and
09/28/2004), the scenario was similar to the one documented on Comment
#56, but I attached them in order to be sure isn't a different issue.

Best regards,
                  Juanjo

Comment 71 Paulo Vilhena 2004-12-01 13:26:52 UTC

Hi Larry,

I have 2 servers:
 HP DL380 with 2 x processors Intel Xeon 3Ghz 
 6Gb of memory 
 RH3-Up3 (2.4.21-20.ELsmp)
 Storage EMC Clariion CX700 using Qlogic QLA2340 fibre
Software:
 Oracle RAC (9.2.0.5.0)
 Oracle 9i (9.2.0.5.0)
 Oracle Collaboration Suite 9.0.4.1.0


I have the same symtom of hang after 20 hours running: ping ok, 
sysreq ok, but no login and no log messages.

Exists any solution to the problem?

ThankÂ´s

Paulo Vilhena

Comment 72 Larry Woodman 2004-12-01 15:10:00 UTC

Juanjo, I believe the "freeze" problem that you were seeing was fixed
in RHEL3-U4, specifically in kernel-2.4.21-20.6.EL with the incorrect
blocking bug inside wakeup_kswapd() patch.  Can you confirm this so we
can close out this bug?

Paulo, as far as your freeze after 20 hours is concerned the most
likely cause is the "Storage EMC Clariion CX700 using Qlogic QLA2340
fibre".  This has caused multiple unrelated hangs on other systems. 
Can you grab me AltSysrq-M, AltSysrq-W and AltSysrq-T outputs when the
system get into that state so I can verify me suspicions?

Larry Woodman

Comment 73 Juanjo Villaplana 2004-12-02 07:03:47 UTC

Hi Larry,

Both servers are running 2.4.21-25.EL from U4 Beta and all the
problems reported on this bug seem solved, so you can close it.

Best regards,
                Juanjo

Comment 74 Paulo Vilhena 2004-12-03 10:36:50 UTC

Larry,

IÂ´m trying to generate the requested information for you.

ThankÂ´s 


Paulo Vilhena

Comment 75 Paulo Vilhena 2004-12-03 10:41:28 UTC

Larry,

One question...

If the problem is "Cariion+QLA2340", what you sugest ?

IÂ´m using the native drive of RH, not EMC drive.

ThankÂ´s

Paulo Vilhena

Comment 76 Ernie Petrides 2004-12-03 21:43:42 UTC

Reverting to MODIFIED state, again.

Comment 77 EZ 2004-12-04 05:26:13 UTC

Where can I get the 2.4.21-25.EL kernel from U4 Beta that Juanjo says works for 
him?  These server freezes are getting out of hand.  Thanks.

EZ

Comment 78 Ernie Petrides 2004-12-06 23:29:17 UTC

It's in the arch-dependent beta channel on RHN, e.g., the one named
"Red Hat Enterprise Linux AS (v. 3 for x86) Beta".  We anticipate
that U4 will be released next week.  The final kernel version is
2.4.21-27.EL.

Comment 79 EZ 2004-12-07 13:01:31 UTC

Thank you Ernie.  I think we will wait until next week for U4 to come out and 
give that a shot.

EZ

Comment 80 Paulo Vilhena 2004-12-10 12:07:53 UTC

Created attachment 108309 [details]
Alt+SysRq logs of Oberon server - Paulo Vilhena

Comment 81 Paulo Vilhena 2004-12-10 12:10:48 UTC

Hi Larry,

In december,01 you ask me about the logs of Alt+SysRq, to verify if 
the freeze of my server is about "EMC Clariion+QLA2340".

The logs is uplodaded.

ThankÂ´s

Paulo Vilhena

Comment 82 Carlos Antonio Gomez 2004-12-15 16:52:31 UTC

Hi larry:

I  have Red Hat Enterprise Linux AS release 3 (Taroon Update 3)
Kernel 2.4.21-26.EL on an i686 and my server hangs with different 
intervals of stability but with the same symptoms:
 -all daemon dies and only kernel is up (ping,Sysrq, ..)

This problem happen (95%) when i send one email to de mail list (the 
server have postfix + mailman and all members are locals )

I atach a SysRq (w,t,m) logs

Regards

Comment 83 Carlos Antonio Gomez 2004-12-15 17:02:03 UTC

Created attachment 108631 [details]
Sysrq Logs

Sysrq logs (w,t,m)

Comment 84 Carlos Antonio Gomez 2004-12-16 14:51:56 UTC

Created attachment 108699 [details]
top 

top logs

Comment 85 John Flanagan 2004-12-20 20:55:04 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html